Unverified Commit 1551e2dc authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

[WIP] Tapas v4 (tres) (#9117)



* First commit: adding all files from tapas_v3

* Fix multiple bugs including soft dependency and new structure of the library

* Improve testing by adding torch_device to inputs and adding dependency on scatter

* Use Python 3 inheritance rather than Python 2

* First draft model cards of base sized models

* Remove model cards as they are already on the hub

* Fix multiple bugs with integration tests

* All model integration tests pass

* Remove print statement

* Add test for convert_logits_to_predictions method of TapasTokenizer

* Incorporate suggestions by Google authors

* Fix remaining tests

* Change position embeddings sizes to 512 instead of 1024

* Comment out positional embedding sizes

* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

* Added more model names

* Fix truncation when no max length is specified

* Disable torchscript test

* Make style & make quality

* Quality

* Address CI needs

* Test the Masked LM model

* Fix the masked LM model

* Truncate when overflowing

* More much needed docs improvements

* Fix some URLs

* Some more docs improvements

* Test PyTorch scatter

* Set to slow + minify

* Calm flake8 down

* First commit: adding all files from tapas_v3

* Fix multiple bugs including soft dependency and new structure of the library

* Improve testing by adding torch_device to inputs and adding dependency on scatter

* Use Python 3 inheritance rather than Python 2

* First draft model cards of base sized models

* Remove model cards as they are already on the hub

* Fix multiple bugs with integration tests

* All model integration tests pass

* Remove print statement

* Add test for convert_logits_to_predictions method of TapasTokenizer

* Incorporate suggestions by Google authors

* Fix remaining tests

* Change position embeddings sizes to 512 instead of 1024

* Comment out positional embedding sizes

* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

* Added more model names

* Fix truncation when no max length is specified

* Disable torchscript test

* Make style & make quality

* Quality

* Address CI needs

* Test the Masked LM model

* Fix the masked LM model

* Truncate when overflowing

* More much needed docs improvements

* Fix some URLs

* Some more docs improvements

* Add add_pooling_layer argument to TapasModel

Fix comments by @sgugger and @patrickvonplaten

* Fix issue in docs + fix style and quality

* Clean up conversion script and add task parameter to TapasConfig

* Revert the task parameter of TapasConfig

Some minor fixes

* Improve conversion script and add test for absolute position embeddings

* Improve conversion script and add test for absolute position embeddings

* Fix bug with reset_position_index_per_cell arg of the conversion cli

* Add notebooks to the examples directory and fix style and quality

* Apply suggestions from code review

* Move from `nielsr/` to `google/` namespace

* Apply Sylvain's comments
Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
Co-authored-by: default avatarRogge Niels <niels.rogge@howest.be>
Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
parent ad895af9
......@@ -79,6 +79,7 @@ jobs:
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,tf-cpu,torch,testing,sentencepiece]
- run: pip install tapas torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cpu.html
- save_cache:
key: v0.4-{{ checksum "setup.py" }}
paths:
......@@ -105,6 +106,7 @@ jobs:
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing,sentencepiece]
- run: pip install tapas torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cpu.html
- save_cache:
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
......@@ -183,6 +185,7 @@ jobs:
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[sklearn,torch,testing,sentencepiece]
- run: pip install tapas torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cpu.html
- save_cache:
key: v0.4-torch-{{ checksum "setup.py" }}
paths:
......
......@@ -50,6 +50,7 @@ jobs:
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
pip install pandas torch-scatter -f https://pytorch-geometric.com/whl/torch-$(python -c "import torch; print(''.join(torch.__version__)")+$(python -c "import torch; print(''.join(torch.version.cuda.split('.')))").html
- name: Are GPUs recognized by our DL frameworks
run: |
......@@ -187,6 +188,7 @@ jobs:
pip install --upgrade pip
pip install .[torch,sklearn,testing,onnxruntime,sentencepiece]
pip install git+https://github.com/huggingface/datasets
pip install pandas torch-scatter -f https://pytorch-geometric.com/whl/torch-$(python -c "import torch; print(''.join(torch.__version__)")+$(python -c "import torch; print(''.join(torch.version.cuda.split('.')))").html
- name: Are GPUs recognized by our DL frameworks
run: |
......
......@@ -222,6 +222,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[TAPAS](https://huggingface.co/transformers/master/model_doc/tapas.html)** released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
......
......@@ -176,19 +176,22 @@ and conversion utilities for the following models:
30. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
31. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
31. `TAPAS <https://huggingface.co/transformers/master/model_doc/tapas.html>`__ released with the paper `TAPAS: Weakly
Supervised Table Parsing via Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof
Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
32. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
32. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
33. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
33. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
34. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
34. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
35. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov.
35. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive
36. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
......@@ -269,6 +272,8 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| T5 | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| TAPAS | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
......@@ -382,6 +387,7 @@ TensorFlow and/or Flax.
model_doc/roberta
model_doc/squeezebert
model_doc/t5
model_doc/tapas
model_doc/transformerxl
model_doc/xlm
model_doc/xlmprophetnet
......
TAPAS
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training
<https://www.aclweb.org/anthology/2020.acl-main.398>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
Francesco Piccinno and Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for
answering questions about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types
that encode tabular structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset
comprising millions of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads
on top: a cell selection head and an aggregation head, for (optionally) performing aggregations (such as counting or
summing) among selected cells. TAPAS has been fine-tuned on several datasets: `SQA
<https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__ (Sequential Question Answering by Microsoft), `WTQ
<https://github.com/ppasupat/WikiTableQuestions>`__ (Wiki Table Questions by Stanford University) and `WikiSQL
<https://github.com/salesforce/WikiSQL>`__ (by Salesforce). It achieves state-of-the-art on both SQA and WTQ, while
having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
The abstract from the paper is the following:
*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the
collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations
instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition,
the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we
present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak
supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation
operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective
joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with
three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by
improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL
and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our
setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
In addition, the authors have further pre-trained TAPAS to recognize **table entailment**, by creating a balanced
dataset of millions of automatically created training examples which are learned in an intermediate step prior to
fine-tuning. The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first
pre-trained on MLM, and then on another dataset). They found that intermediate pre-training further improves
performance on SQA, achieving a new state-of-the-art as well as state-of-the-art on `TabFact
<https://github.com/wenhuchen/Table-Fact-Checking>`__, a large-scale dataset with 16k Wikipedia tables for table
entailment (a binary classification task). For more details, see their follow-up paper: `Understanding tables with
intermediate pre-training <https://www.aclweb.org/anthology/2020.findings-emnlp.27/>`__ by Julian Martin Eisenschlos,
Syrine Krichene and Thomas Müller.
The original code can be found `here <https://github.com/google-research/tapas>`__.
Tips:
- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell
of the table). Note that this is something that was added after the publication of the original TAPAS paper.
According to the authors, this usually results in a slightly better performance, and allows you to encode longer
sequences without running out of embeddings. This is reflected in the ``reset_position_index_per_cell`` parameter of
:class:`~transformers.TapasConfig`, which is set to ``True`` by default. The default versions of the models available
in the `model hub <https://huggingface.co/models?search=tapas>`_ all use relative position embeddings. You can still
use the ones with absolute position embeddings by passing in an additional argument ``revision="no_reset"`` when
calling the ``.from_pretrained()`` method. Note that it's usually advised to pad the inputs on the right rather than
the left.
- TAPAS is based on BERT, so ``TAPAS-base`` for example corresponds to a ``BERT-base`` architecture. Of course,
TAPAS-large will result in the best performance (the results reported in the paper are from TAPAS-large). Results of
the various sized models are shown on the `original Github repository <https://github.com/google-research/tapas>`_.
- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a
conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the
previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that
case, you have to feed every table-question pair one by one to the model, such that the `prev_labels` token type ids
can be overwritten by the predicted `labels` of the model to the previous question. See "Usage" section for more
info.
- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
with a causal language modeling (CLM) objective are better in that regard.
Usage: fine-tuning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here we explain how you can fine-tune :class:`~transformers.TapasForQuestionAnswering` on your own dataset.
**STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment**
Basically, there are 3 different ways in which one can fine-tune :class:`~transformers.TapasForQuestionAnswering`,
corresponding to the different datasets on which Tapas was fine-tuned:
1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example
if you first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is
he?". Here, questions do not involve any aggregation (all questions are cell selection questions).
2. WTQ: if you're not interested in asking questions in a conversational set-up, but rather just asking questions
related to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or
averaging cell values. You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his
career?". This case is also called **weak supervision**, since the model itself must learn the appropriate
aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.
3. WikiSQL-supervised: this dataset is based on WikiSQL with the model being given the ground truth aggregation
operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation
operator is much easier.
To summarize:
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| **Task** | **Example dataset** | **Description** |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| Conversational | SQA | Conversational, only cell selection questions |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| Weak supervision for aggregation | WTQ | Questions might involve aggregation, and the model must learn this given only the answer as supervision |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
| Strong supervision for aggregation | WikiSQL-supervised | Questions might involve aggregation, and the model must learn this given the gold aggregation operator |
+------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
Initializing a model with a pre-trained base and randomly initialized classification heads from the model hub can be
done as follows (be sure to have installed the `torch-scatter dependency <https://github.com/rusty1s/pytorch_scatter>`_
for your environment):
.. code-block::
>>> from transformers import TapasConfig, TapasForQuestionAnswering
>>> # for example, the base sized model with default SQA configuration
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base')
>>> # or, the base sized model with WTQ configuration
>>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
>>> # or, the base sized model with WikiSQL configuration
>>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also
experiment by defining any hyperparameters you want when initializing :class:`~transformers.TapasConfig`, and then
create a :class:`~transformers.TapasForQuestionAnswering` based on that configuration. For example, if you have a
dataset that has both conversational questions and questions that might involve aggregation, then you can do it this
way. Here's an example:
.. code-block::
>>> from transformers import TapasConfig, TapasForQuestionAnswering
>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True, select_one_column=False)
>>> # initializing the pre-trained base sized model with our custom classification heads
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned
checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See `here
<https://github.com/google-research/tapas/issues/91#issuecomment-735719340>`__ for more info.
For a list of all pre-trained and fine-tuned TAPAS checkpoints available in the HuggingFace model hub, see `here
<https://huggingface.co/models?search=tapas>`__.
**STEP 2: Prepare your data in the SQA format**
Second, no matter what you picked above, you should prepare your dataset in the `SQA format
<https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__. This format is a TSV/CSV file with the following
columns:
- ``id``: optional, id of the table-question pair, for bookkeeping purposes.
- ``annotator``: optional, id of the person who annotated the table-question pair, for bookkeeping purposes.
- ``position``: integer indicating if the question is the first, second, third,... related to the table. Only required
in case of conversational setup (SQA). You don't need this column in case you're going for WTQ/WikiSQL-supervised.
- ``question``: string
- ``table_file``: string, name of a csv file containing the tabular data
- ``answer_coordinates``: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is
part of the answer)
- ``answer_text``: list of one or more strings (each string being a cell value that is part of the answer)
- ``aggregation_label``: index of the aggregation operator. Only required in case of strong supervision for aggregation
(the WikiSQL-supervised case)
- ``float_answer``: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of
weak supervision for aggregation (such as WTQ and WikiSQL)
The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the
TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ, WikiSQL) into the
SQA format. The author explains this `here
<https://github.com/google-research/tapas/issues/50#issuecomment-705465960>`__. Interestingly, these conversion scripts
are not perfect (the ``answer_coordinates`` and ``float_answer`` fields are populated based on the ``answer_text``),
meaning that WTQ and WikiSQL results could actually be improved.
**STEP 3: Convert your data into PyTorch tensors using TapasTokenizer**
Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular
data), you can then use :class:`~transformers.TapasTokenizer` to convert table-question pairs into :obj:`input_ids`,
:obj:`attention_mask`, :obj:`token_type_ids` and so on. Again, based on which of the three cases you picked above,
:class:`~transformers.TapasForQuestionAnswering` requires different inputs to be fine-tuned:
+------------------------------------+----------------------------------------------------------------------------------------------+
| **Task** | **Required inputs** |
+------------------------------------+----------------------------------------------------------------------------------------------+
| Conversational | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``labels`` |
+------------------------------------+----------------------------------------------------------------------------------------------+
| Weak supervision for aggregation | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``labels``, ``numeric_values``, |
| | ``numeric_values_scale``, ``float_answer`` |
+------------------------------------+----------------------------------------------------------------------------------------------+
| Strong supervision for aggregation | ``input ids``, ``attention mask``, ``token type ids``, ``labels``, ``aggregation_labels`` |
+------------------------------------+----------------------------------------------------------------------------------------------+
:class:`~transformers.TapasTokenizer` creates the ``labels``, ``numeric_values`` and ``numeric_values_scale`` based on
the ``answer_coordinates`` and ``answer_text`` columns of the TSV file. The ``float_answer`` and ``aggregation_labels``
are already in the TSV file of step 2. Here's an example:
.. code-block::
>>> from transformers import TapasTokenizer
>>> import pandas as pd
>>> model_name = 'google/tapas-base'
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
>>> inputs
{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
Note that :class:`~transformers.TapasTokenizer` expects the data of the table to be **text-only**. You can use
``.astype(str)`` on a dataframe to turn it into text-only data. Of course, this only shows how to encode a single
training example. It is advised to create a PyTorch dataset and a corresponding dataloader:
.. code-block::
>>> import torch
>>> import pandas as pd
>>> tsv_path = "your_path_to_the_tsv_file"
>>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
>>> class TableDataset(torch.utils.data.Dataset):
... def __init__(self, data, tokenizer):
... self.data = data
... self.tokenizer = tokenizer
...
... def __getitem__(self, idx):
... item = data.iloc[idx]
... table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
... encoding = self.tokenizer(table=table,
... queries=item.question,
... answer_coordinates=item.answer_coordinates,
... answer_text=item.answer_text,
... truncation=True,
... padding="max_length",
... return_tensors="pt"
... )
... # remove the batch dimension which the tokenizer adds by default
... encoding = {key: val.squeeze(0) for key, val in encoding.items()}
... # add the float_answer which is also required (weak supervision for aggregation case)
... encoding["float_answer"] = torch.tensor(item.float_answer)
... return encoding
...
... def __len__(self):
... return len(self.data)
>>> data = pd.read_csv(tsv_path, sep='\t')
>>> train_dataset = TableDataset(data, tokenizer)
>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not
conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group
together the ``queries``, ``answer_coordinates`` and ``answer_text`` per table (in the order of their ``position``
index) and batch encode each table with its questions. This will make sure that the ``prev_labels`` token types (see
docs of :class:`~transformers.TapasTokenizer`) are set correctly. See `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__
for more info.
**STEP 4: Train (fine-tune) TapasForQuestionAnswering**
You can then fine-tune :class:`~transformers.TapasForQuestionAnswering` using native PyTorch as follows (shown here for
the weak supervision for aggregation case):
.. code-block::
>>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW
>>> # this is the default WTQ configuration
>>> config = TapasConfig(
... num_aggregation_labels = 4,
... use_answer_as_supervision = True,
... answer_loss_cutoff = 0.664694,
... cell_selection_preference = 0.207951,
... huber_loss_delta = 0.121194,
... init_cell_selection_weights_to_zero = True,
... select_one_column = True,
... allow_empty_column_selection = False,
... temperature = 0.0352513,
... )
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
>>> optimizer = AdamW(model.parameters(), lr=5e-5)
>>> for epoch in range(2): # loop over the dataset multiple times
... for idx, batch in enumerate(train_dataloader):
... # get the inputs;
... input_ids = batch["input_ids"]
... attention_mask = batch["attention_mask"]
... token_type_ids = batch["token_type_ids"]
... labels = batch["labels"]
... numeric_values = batch["numeric_values"]
... numeric_values_scale = batch["numeric_values_scale"]
... float_answer = batch["float_answer"]
... # zero the parameter gradients
... optimizer.zero_grad()
... # forward + backward + optimize
... outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
... labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale,
... float_answer=float_answer)
... loss = outputs.loss
... loss.backward()
... optimizer.step()
Usage: inference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here we explain how you can use :class:`~transformers.TapasForQuestionAnswering` for inference (i.e. making predictions
on new data). For inference, only ``input_ids``, ``attention_mask`` and ``token_type_ids`` (which you can obtain using
:class:`~transformers.TapasTokenizer`) have to be provided to the model to obtain the logits. Next, you can use the
handy ``convert_logits_to_predictions`` method of :class:`~transformers.TapasTokenizer` to convert these into predicted
coordinates and optional aggregation indices.
However, note that inference is **different** depending on whether or not the setup is conversational. In a
non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example
of that:
.. code-block::
>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
>>> import pandas as pd
>>> model_name = 'google/tapas-base-finetuned-wtq'
>>> model = TapasForQuestionAnswering.from_pretrained(model_name)
>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
>>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt")
>>> outputs = model(**inputs)
>>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
... inputs,
... outputs.logits.detach(),
... outputs.logits_aggregation.detach()
...)
>>> # let's print out the results:
>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
>>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
>>> answers = []
>>> for coordinates in predicted_answer_coordinates:
... if len(coordinates) == 1:
... # only a single cell:
... answers.append(table.iat[coordinates[0]])
... else:
... # multiple cells
... cell_values = []
... for coordinate in coordinates:
... cell_values.append(table.iat[coordinate])
... answers.append(", ".join(cell_values))
>>> display(table)
>>> print("")
>>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
... print(query)
... if predicted_agg == "NONE":
... print("Predicted answer: " + answer)
... else:
... print("Predicted answer: " + predicted_agg + " > " + answer)
What is the name of the first actor?
Predicted answer: Brad Pitt
How many movies has George Clooney played in?
Predicted answer: COUNT > 69
What is the total number of movies?
Predicted answer: SUM > 87, 53, 69
In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such
that the ``prev_labels`` token types can be overwritten by the predicted ``labels`` of the previous table-question
pair. Again, more info can be found in `this notebook
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb>`__.
Tapas specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.tapas.modeling_tapas.TableQuestionAnsweringOutput
:members:
TapasConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasConfig
:members:
TapasTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasTokenizer
:members: __call__, convert_logits_to_predictions, save_vocabulary
TapasModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasModel
:members: forward
TapasForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasForMaskedLM
:members: forward
TapasForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasForSequenceClassification
:members: forward
TapasForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TapasForQuestionAnswering
:members: forward
---
language: en
tags:
- tapas
- masked-lm
license: apache-2.0
---
# TAPAS base model
This model corresponds to the `tapas_inter_masklm_base_reset` checkpoint of the [original Github repository](https://github.com/google-research/tapas).
Disclaimer: The team releasing TAPAS did not write a model card for this model so this model card has been written by
the Hugging Face team and contributors.
## Model description
TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion.
This means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way (which is why it
can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
was pretrained with two objectives:
- Masked language modeling (MLM): taking a (flattened) table and associated context, the model randomly masks 15% of the words in
the input, then runs the entire (partially masked) sequence through the model. The model then has to predict the masked words.
This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other,
or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional
representation of a table and associated text.
- Intermediate pre-training: to encourage numerical reasoning on tables, the authors additionally pre-trained the model by creating
a balanced dataset of millions of syntactically created training examples. Here, the model must predict (classify) whether a sentence
is supported or refuted by the contents of a table. The training examples are created based on synthetic as well as counterfactual statements.
This way, the model learns an inner representation of the English language used in tables and associated texts, which can then be used
to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed
or refuted by the contents of a table. Fine-tuning is done by adding classification heads on top of the pre-trained model, and then jointly
train the randomly initialized classification heads with the base model on a labelled dataset.
## Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
See the [model hub](https://huggingface.co/models?filter=tapas) to look for fine-tuned versions on a task that interests you.
Here is how to use this model to get the features of a given table-text pair in PyTorch:
```python
from transformers import TapasTokenizer, TapasModel
import pandas as pd
tokenizer = TapasTokenizer.from_pretrained('tapase-base')
model = TapasModel.from_pretrained("tapas-base")
data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
'Age': ["56", "45", "59"],
'Number of movies': ["87", "53", "69"]
}
table = pd.DataFrame.from_dict(data)
queries = ["How many movies has George Clooney played in?"]
text = "Replace me by any text you'd like."
encoded_input = tokenizer(table=table, queries=queries, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
For masked language modeling (MLM), a collection of 6.2 million tables was extracted from English Wikipedia: 3.3M of class [Infobox](https://en.wikipedia.org/wiki/Help:Infobox)
and 2.9M of class WikiTable. The author only considered tables with at most 500 cells. As a proxy for questions that appear in the
downstream tasks, the authros extracted the table caption, article title, article description, segment title and text of the segment
the table occurs in as relevant text snippets. In this way, 21.3M snippets were created. For more info, see the original [TAPAS paper](https://www.aclweb.org/anthology/2020.acl-main.398.pdf).
For intermediate pre-training, 2 tasks are introduced: one based on synthetic and the other from counterfactual statements. The first one
generates a sentence by sampling from a set of logical expressions that filter, combine and compare the information on the table, which is
required in table entailment (e.g., knowing that Gerald Ford is taller than the average president requires summing
all presidents and dividing by the number of presidents). The second one corrupts sentences about tables appearing on Wikipedia by swapping
entities for plausible alternatives. Examples of the two tasks can be seen in Figure 1. The procedure is described in detail in section 3 of
the [TAPAS follow-up paper](https://www.aclweb.org/anthology/2020.findings-emnlp.27.pdf).
## Training procedure
### Preprocessing
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
then of the form:
```
[CLS] Context [SEP] Flattened table [SEP]
```
The details of the masking procedure for each sequence are the following:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
The details of the creation of the synthetic and counterfactual examples can be found in the [follow-up paper](https://arxiv.org/abs/2010.00571).
### Pretraining
The model was trained on 32 Cloud TPU v3 cores for one million steps with maximum sequence length 512 and batch size of 512.
In this setup, pre-training takes around 3 days. The optimizer used is Adam with a learning rate of 5e-5, and a warmup ratio
of 0.10.
### BibTeX entry and citation info
```bibtex
@misc{herzig2020tapas,
title={TAPAS: Weakly Supervised Table Parsing via Pre-training},
author={Jonathan Herzig and Paweł Krzysztof Nowak and Thomas Müller and Francesco Piccinno and Julian Martin Eisenschlos},
year={2020},
eprint={2004.02349},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
```
```bibtex
@misc{eisenschlos2020understanding,
title={Understanding tables with intermediate pre-training},
author={Julian Martin Eisenschlos and Syrine Krichene and Thomas Müller},
year={2020},
eprint={2010.00571},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
\ No newline at end of file
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# 🤗 Transformers Notebooks
You can find here a list of the official notebooks provided by Hugging Face.
Also, we would like to list here interesting content created by the community.
If you wrote some notebook(s) leveraging 🤗 Transformers and would like be listed here, please open a
Pull Request so it can be included under the Community notebooks.
## Hugging Face's notebooks 🤗
| Notebook | Description | |
|:----------|:-------------|------:|
| [Getting Started Tokenizers](https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
| [How to use Pipelines](https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
| [How to generate text](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)|
| [How to export model to ONNX](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) | Highlight how to export and run inference workloads through ONNX |
| [How to use Benchmarks](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) | How to benchmark models with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb)|
| [Reformer](https://github.com/huggingface/blog/blob/master/notebooks/03_reformer.ipynb) | How Reformer pushes the limits of language modeling | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/blog/blob/master/notebooks/03_reformer.ipynb)|
## Community notebooks:
| Notebook | Description | Author | |
|:----------|:-------------|:-------------|------:|
| [Train T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
| [Long Sequence Modeling with Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | How to train on sequences as long as 500,000 tokens with Reformer | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) |
| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) |
| [Fine-tune a pre-trained Transformer on anyone's tweets](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | How to generate tweets in the style of your favorite Twitter account by fine-tune a GPT-2 model | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) |
| [A Step by Step Guide to Tracking Hugging Face Model Performance](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) | A quick tutorial for training NLP models with HuggingFace and & visualizing their performance with Weights & Biases | [Jack Morris](https://github.com/jxmorris12) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) |
| [Pretrain Longformer](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | How to build a "long" version of existing pretrained models | [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) |
| [Fine-tune Longformer for QA](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | How to fine-tune longformer model for QA task | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) |
| [Evaluate Model with 🤗nlp](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | How to evaluate longformer on TriviaQA with `nlp` | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) |
| [Fine-tune T5 for Sentiment Span Extraction](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | How to fine-tune T5 for sentiment span extraction using a text-to-text format with PyTorch Lightning | [Lorenzo Ampil](https://github.com/enzoampil) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) |
| [Fine-tune DistilBert for Multiclass Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | How to fine-tune DistilBert for multiclass classification with PyTorch | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)|
|[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
|[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
|[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|
|[Expand and Fine Tune Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)|
|[Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)|
|[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)|
|[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)|
|[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune an Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
|[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
|[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)|
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# 🤗 Transformers Notebooks
You can find here a list of the official notebooks provided by Hugging Face.
Also, we would like to list here interesting content created by the community.
If you wrote some notebook(s) leveraging 🤗 Transformers and would like be listed here, please open a
Pull Request so it can be included under the Community notebooks.
## Hugging Face's notebooks 🤗
| Notebook | Description | |
|:----------|:-------------|------:|
| [Getting Started Tokenizers](https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
| [How to use Pipelines](https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
| [How to generate text](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)|
| [How to export model to ONNX](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) | Highlight how to export and run inference workloads through ONNX |
| [How to use Benchmarks](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) | How to benchmark models with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb)|
| [Reformer](https://github.com/huggingface/blog/blob/master/notebooks/03_reformer.ipynb) | How Reformer pushes the limits of language modeling | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/blog/blob/master/notebooks/03_reformer.ipynb)|
## Community notebooks:
| Notebook | Description | Author | |
|:----------|:-------------|:-------------|------:|
| [Train T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
| [Long Sequence Modeling with Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | How to train on sequences as long as 500,000 tokens with Reformer | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) |
| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) |
| [Fine-tune a pre-trained Transformer on anyone's tweets](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | How to generate tweets in the style of your favorite Twitter account by fine-tune a GPT-2 model | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) |
| [A Step by Step Guide to Tracking Hugging Face Model Performance](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) | A quick tutorial for training NLP models with HuggingFace and & visualizing their performance with Weights & Biases | [Jack Morris](https://github.com/jxmorris12) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) |
| [Pretrain Longformer](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | How to build a "long" version of existing pretrained models | [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) |
| [Fine-tune Longformer for QA](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | How to fine-tune longformer model for QA task | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) |
| [Evaluate Model with 🤗nlp](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | How to evaluate longformer on TriviaQA with `nlp` | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) |
| [Fine-tune T5 for Sentiment Span Extraction](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | How to fine-tune T5 for sentiment span extraction using a text-to-text format with PyTorch Lightning | [Lorenzo Ampil](https://github.com/enzoampil) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) |
| [Fine-tune DistilBert for Multiclass Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | How to fine-tune DistilBert for multiclass classification with PyTorch | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)|
|[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
|[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
|[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|
|[Expand and Fine Tune Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)|
|[Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)|
|[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)|
|[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)|
|[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune an Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
|[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
|[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)|
|[Fine-tuning TAPAS on Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | How to fine-tune *TapasForQuestionAnswering* with a *tapas-base* checkpoint on the Sequential Question Answering (SQA) dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Fine_tuning_TapasForQuestionAnswering_on_SQAipynb)|
|[Evaluating TAPAS on Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | How to evaluate a fine-tuned *TapasForSequenceClassification* with a *tapas-base-finetuned-tabfact* checkpoint using a combination of the 🤗 datasets and 🤗 transformers libraries | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)|
......@@ -164,6 +164,7 @@ from .models.retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBert
from .models.roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig, RobertaTokenizer
from .models.squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig, SqueezeBertTokenizer
from .models.t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from .models.tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig, TapasTokenizer
from .models.transfo_xl import (
TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
TransfoXLConfig,
......@@ -605,6 +606,13 @@ if is_torch_available():
T5PreTrainedModel,
load_tf_weights_in_t5,
)
from .models.tapas import (
TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST,
TapasForMaskedLM,
TapasForQuestionAnswering,
TapasForSequenceClassification,
TapasModel,
)
from .models.transfo_xl import (
TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,
AdaptiveEmbedding,
......
......@@ -216,6 +216,29 @@ except ImportError:
_tokenizers_available = False
try:
import pandas # noqa: F401
_pandas_available = True
except ImportError:
_pandas_available = False
try:
import torch_scatter
# Check we're not importing a "torch_scatter" directory somewhere
_scatter_available = hasattr(torch_scatter, "__version__") and hasattr(torch_scatter, "scatter")
if _scatter_available:
logger.debug(f"Succesfully imported torch-scatter version {torch_scatter.__version__}")
else:
logger.debug("Imported a torch_scatter object but this doesn't seem to be the torch-scatter library.")
except ImportError:
_scatter_available = False
old_default_cache_path = os.path.join(torch_cache_home, "transformers")
# New default cache, shared with the Datasets library
hf_cache_home = os.path.expanduser(
......@@ -325,6 +348,14 @@ def is_in_notebook():
return _in_notebook
def is_scatter_available():
return _scatter_available
def is_pandas_available():
return _pandas_available
def torch_only_method(fn):
def wrapper(*args, **kwargs):
if not _torch_available:
......@@ -427,6 +458,13 @@ installation page: https://github.com/google/flax and follow the ones that match
"""
# docstyle-ignore
SCATTER_IMPORT_ERROR = """
{0} requires the torch-scatter library but it was not found in your environment. You can install it with pip as
explained here: https://github.com/rusty1s/pytorch_scatter.
"""
def requires_datasets(obj):
name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
if not is_datasets_available():
......@@ -481,6 +519,12 @@ def requires_protobuf(obj):
raise ImportError(PROTOBUF_IMPORT_ERROR.format(name))
def requires_scatter(obj):
name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
if not is_scatter_available():
raise ImportError(SCATTER_IMPORT_ERROR.format(name))
def add_start_docstrings(*docstr):
def docstring_decorator(fn):
fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
......
......@@ -51,6 +51,7 @@ from ..retribert.configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCH
from ..roberta.configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
from ..squeezebert.configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
from ..t5.configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from ..tapas.configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig
from ..transfo_xl.configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
from ..xlm.configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
from ..xlm_prophetnet.configuration_xlm_prophetnet import (
......@@ -95,6 +96,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP,
]
for key, value, in pretrained_map.items()
)
......@@ -141,6 +143,7 @@ CONFIG_MAPPING = OrderedDict(
("dpr", DPRConfig),
("layoutlm", LayoutLMConfig),
("rag", RagConfig),
("tapas", TapasConfig),
]
)
......@@ -185,6 +188,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("prophetnet", "ProphetNet"),
("mt5", "mT5"),
("mpnet", "MPNet"),
("tapas", "TAPAS"),
]
)
......
......@@ -165,6 +165,12 @@ from ..squeezebert.modeling_squeezebert import (
SqueezeBertModel,
)
from ..t5.modeling_t5 import T5ForConditionalGeneration, T5Model
from ..tapas.modeling_tapas import (
TapasForMaskedLM,
TapasForQuestionAnswering,
TapasForSequenceClassification,
TapasModel,
)
from ..transfo_xl.modeling_transfo_xl import TransfoXLForSequenceClassification, TransfoXLLMHeadModel, TransfoXLModel
from ..xlm.modeling_xlm import (
XLMForMultipleChoice,
......@@ -230,6 +236,7 @@ from .configuration_auto import (
RobertaConfig,
SqueezeBertConfig,
T5Config,
TapasConfig,
TransfoXLConfig,
XLMConfig,
XLMProphetNetConfig,
......@@ -277,6 +284,7 @@ MODEL_MAPPING = OrderedDict(
(XLMProphetNetConfig, XLMProphetNetModel),
(ProphetNetConfig, ProphetNetModel),
(MPNetConfig, MPNetModel),
(TapasConfig, TapasModel),
]
)
......@@ -308,6 +316,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
(LxmertConfig, LxmertForPreTraining),
(FunnelConfig, FunnelForPreTraining),
(MPNetConfig, MPNetForMaskedLM),
(TapasConfig, TapasForMaskedLM),
]
)
......@@ -340,6 +349,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
(ReformerConfig, ReformerModelWithLMHead),
(FunnelConfig, FunnelForMaskedLM),
(MPNetConfig, MPNetForMaskedLM),
(TapasConfig, TapasForMaskedLM),
]
)
......@@ -386,6 +396,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
(ReformerConfig, ReformerForMaskedLM),
(FunnelConfig, FunnelForMaskedLM),
(MPNetConfig, MPNetForMaskedLM),
(TapasConfig, TapasForMaskedLM),
]
)
......@@ -431,6 +442,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
(CTRLConfig, CTRLForSequenceClassification),
(TransfoXLConfig, TransfoXLForSequenceClassification),
(MPNetConfig, MPNetForSequenceClassification),
(TapasConfig, TapasForSequenceClassification),
]
)
......@@ -455,6 +467,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
(FunnelConfig, FunnelForQuestionAnswering),
(LxmertConfig, LxmertForQuestionAnswering),
(MPNetConfig, MPNetForQuestionAnswering),
(TapasConfig, TapasForQuestionAnswering),
]
)
......
......@@ -47,6 +47,7 @@ from ..rag.tokenization_rag import RagTokenizer
from ..retribert.tokenization_retribert import RetriBertTokenizer
from ..roberta.tokenization_roberta import RobertaTokenizer
from ..squeezebert.tokenization_squeezebert import SqueezeBertTokenizer
from ..tapas.tokenization_tapas import TapasTokenizer
from ..transfo_xl.tokenization_transfo_xl import TransfoXLTokenizer
from ..xlm.tokenization_xlm import XLMTokenizer
from .configuration_auto import (
......@@ -84,6 +85,7 @@ from .configuration_auto import (
RobertaConfig,
SqueezeBertConfig,
T5Config,
TapasConfig,
TransfoXLConfig,
XLMConfig,
XLMProphetNetConfig,
......@@ -223,6 +225,7 @@ TOKENIZER_MAPPING = OrderedDict(
(XLMProphetNetConfig, (XLMProphetNetTokenizer, None)),
(ProphetNetConfig, (ProphetNetTokenizer, None)),
(MPNetConfig, (MPNetTokenizer, MPNetTokenizerFast)),
(TapasConfig, (TapasTokenizer, None)),
]
)
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...file_utils import is_torch_available
from .configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig
from .tokenization_tapas import TapasTokenizer
if is_torch_available():
from .modeling_tapas import (
TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST,
TapasForMaskedLM,
TapasForQuestionAnswering,
TapasForSequenceClassification,
TapasModel,
)
# coding=utf-8
# Copyright 2020 Google Research and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
TAPAS configuration. Based on the BERT configuration with added parameters.
Hyperparameters are taken from run_task_main.py and hparam_utils.py of the original implementation. URLS:
- https://github.com/google-research/tapas/blob/master/tapas/run_task_main.py
- https://github.com/google-research/tapas/blob/master/tapas/utils/hparam_utils.py
"""
from ...configuration_utils import PretrainedConfig
TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"google/tapas-base-finetuned-sqa": "https://huggingface.co/google/tapas-base-finetuned-sqa/resolve/main/config.json",
"google/tapas-base-finetuned-wtq": "https://huggingface.co/google/tapas-base-finetuned-wtq/resolve/main/config.json",
"google/tapas-base-finetuned-wikisql-supervised": "https://huggingface.co/google/tapas-base-finetuned-wikisql-supervised/resolve/main/config.json",
"google/tapas-base-finetuned-tabfact": "https://huggingface.co/google/tapas-base-finetuned-tabfact/resolve/main/config.json",
}
class TapasConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`. It is used to
instantiate a TAPAS model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the TAPAS `tapas-base-finetuned-sqa`
architecture. Configuration objects inherit from :class:`~transformers.PreTrainedConfig` and can be used to control
the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original
implementation. Original implementation available at https://github.com/google-research/tapas/tree/master.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 30522):
Vocabulary size of the TAPAS model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.TapasModel`.
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[3, 256, 256, 2, 256, 256, 10]`):
The vocabulary sizes of the :obj:`token_type_ids` passed when calling :class:`~transformers.TapasModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use gradient checkpointing to save memory at the expense of a slower backward pass.
positive_label_weight (:obj:`float`, `optional`, defaults to 10.0):
Weight for positive labels.
num_aggregation_labels (:obj:`int`, `optional`, defaults to 0):
The number of aggregation operators to predict.
aggregation_loss_weight (:obj:`float`, `optional`, defaults to 1.0):
Importance weight for the aggregation loss.
use_answer_as_supervision (:obj:`bool`, `optional`):
Whether to use the answer as the only supervision for aggregation examples.
answer_loss_importance (:obj:`float`, `optional`, defaults to 1.0):
Importance weight for the regression loss.
use_normalized_answer_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to normalize the answer loss by the maximum of the predicted and expected value.
huber_loss_delta: (:obj:`float`, `optional`):
Delta parameter used to calculate the regression loss.
temperature: (:obj:`float`, `optional`, defaults to 1.0):
Value used to control (OR change) the skewness of cell logits probabilities.
aggregation_temperature: (:obj:`float`, `optional`, defaults to 1.0):
Scales aggregation logits to control the skewness of probabilities.
use_gumbel_for_cells: (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to apply Gumbel-Softmax to cell selection.
use_gumbel_for_aggregation: (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to apply Gumbel-Softmax to aggregation selection.
average_approximation_function: (:obj:`string`, `optional`, defaults to :obj:`"ratio"`):
Method to calculate the expected average of cells in the weak supervision case. One of :obj:`"ratio"`,
:obj:`"first_order"` or :obj:`"second_order"`.
cell_selection_preference: (:obj:`float`, `optional`):
Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for
aggregation (WTQ, WikiSQL). If the total mass of the aggregation probabilities (excluding the "NONE"
operator) is higher than this hyperparameter, then aggregation is predicted for an example.
answer_loss_cutoff: (:obj:`float`, `optional`):
Ignore examples with answer loss larger than cutoff.
max_num_rows: (:obj:`int`, `optional`, defaults to 64):
Maximum number of rows.
max_num_columns: (:obj:`int`, `optional`, defaults to 32):
Maximum number of columns.
average_logits_per_cell: (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to average logits per cell.
select_one_column: (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to constrain the model to only select cells from a single column.
allow_empty_column_selection: (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to allow not to select any column.
init_cell_selection_weights_to_zero: (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to initialize cell selection weights to 0 so that the initial probabilities are 50%.
reset_position_index_per_cell: (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to restart position indexes at every cell (i.e. use relative position embeddings).
disable_per_token_loss: (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to disable any (strong or weak) supervision on cells.
Example::
>>> from transformers import TapasModel, TapasConfig
>>> # Initializing a default (SQA) Tapas configuration
>>> configuration = TapasConfig()
>>> # Initializing a model from the configuration
>>> model = TapasModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "tapas"
def __init__(
self,
vocab_size=30522,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=1024,
type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10],
initializer_range=0.02,
layer_norm_eps=1e-12,
pad_token_id=0,
gradient_checkpointing=False,
positive_label_weight=10.0,
num_aggregation_labels=0,
aggregation_loss_weight=1.0,
use_answer_as_supervision=None,
answer_loss_importance=1.0,
use_normalized_answer_loss=False,
huber_loss_delta=None,
temperature=1.0,
aggregation_temperature=1.0,
use_gumbel_for_cells=False,
use_gumbel_for_aggregation=False,
average_approximation_function="ratio",
cell_selection_preference=None,
answer_loss_cutoff=None,
max_num_rows=64,
max_num_columns=32,
average_logits_per_cell=False,
select_one_column=True,
allow_empty_column_selection=False,
init_cell_selection_weights_to_zero=False,
reset_position_index_per_cell=True,
disable_per_token_loss=False,
**kwargs
):
super().__init__(pad_token_id=pad_token_id, **kwargs)
# BERT hyperparameters (with updated max_position_embeddings and type_vocab_sizes)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_sizes = type_vocab_sizes
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.gradient_checkpointing = gradient_checkpointing
# Fine-tuning task hyperparameters
self.positive_label_weight = positive_label_weight
self.num_aggregation_labels = num_aggregation_labels
self.aggregation_loss_weight = aggregation_loss_weight
self.use_answer_as_supervision = use_answer_as_supervision
self.answer_loss_importance = answer_loss_importance
self.use_normalized_answer_loss = use_normalized_answer_loss
self.huber_loss_delta = huber_loss_delta
self.temperature = temperature
self.aggregation_temperature = aggregation_temperature
self.use_gumbel_for_cells = use_gumbel_for_cells
self.use_gumbel_for_aggregation = use_gumbel_for_aggregation
self.average_approximation_function = average_approximation_function
self.cell_selection_preference = cell_selection_preference
self.answer_loss_cutoff = answer_loss_cutoff
self.max_num_rows = max_num_rows
self.max_num_columns = max_num_columns
self.average_logits_per_cell = average_logits_per_cell
self.select_one_column = select_one_column
self.allow_empty_column_selection = allow_empty_column_selection
self.init_cell_selection_weights_to_zero = init_cell_selection_weights_to_zero
self.reset_position_index_per_cell = reset_position_index_per_cell
self.disable_per_token_loss = disable_per_token_loss
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert TAPAS checkpoint."""
import argparse
from transformers.models.tapas.modeling_tapas import (
TapasConfig,
TapasForMaskedLM,
TapasForQuestionAnswering,
TapasForSequenceClassification,
TapasModel,
load_tf_weights_in_tapas,
)
from transformers.models.tapas.tokenization_tapas import TapasTokenizer
from transformers.utils import logging
logging.set_verbosity_info()
def convert_tf_checkpoint_to_pytorch(
task, reset_position_index_per_cell, tf_checkpoint_path, tapas_config_file, pytorch_dump_path
):
# Initialise PyTorch model.
# If you want to convert a checkpoint that uses absolute position embeddings, make sure to set reset_position_index_per_cell of
# TapasConfig to False.
# initialize configuration from json file
config = TapasConfig.from_json_file(tapas_config_file)
# set absolute/relative position embeddings parameter
config.reset_position_index_per_cell = reset_position_index_per_cell
# set remaining parameters of TapasConfig as well as the model based on the task
if task == "SQA":
model = TapasForQuestionAnswering(config=config)
elif task == "WTQ":
# run_task_main.py hparams
config.num_aggregation_labels = 4
config.use_answer_as_supervision = True
# hparam_utils.py hparams
config.answer_loss_cutoff = 0.664694
config.cell_selection_preference = 0.207951
config.huber_loss_delta = 0.121194
config.init_cell_selection_weights_to_zero = True
config.select_one_column = True
config.allow_empty_column_selection = False
config.temperature = 0.0352513
model = TapasForQuestionAnswering(config=config)
elif task == "WIKISQL_SUPERVISED":
# run_task_main.py hparams
config.num_aggregation_labels = 4
config.use_answer_as_supervision = False
# hparam_utils.py hparams
config.answer_loss_cutoff = 36.4519
config.cell_selection_preference = 0.903421
config.huber_loss_delta = 222.088
config.init_cell_selection_weights_to_zero = True
config.select_one_column = True
config.allow_empty_column_selection = True
config.temperature = 0.763141
model = TapasForQuestionAnswering(config=config)
elif task == "TABFACT":
model = TapasForSequenceClassification(config=config)
elif task == "MLM":
model = TapasForMaskedLM(config=config)
elif task == "INTERMEDIATE_PRETRAINING":
model = TapasModel(config=config)
print("Building PyTorch model from configuration: {}".format(str(config)))
# Load weights from tf checkpoint
load_tf_weights_in_tapas(model, config, tf_checkpoint_path)
# Save pytorch-model (weights and configuration)
print("Save PyTorch model to {}".format(pytorch_dump_path))
model.save_pretrained(pytorch_dump_path[:-17])
# Save tokenizer files
dir_name = r"C:\Users\niels.rogge\Documents\Python projecten\tensorflow\Tensorflow models\SQA\Base\tapas_sqa_inter_masklm_base_reset"
tokenizer = TapasTokenizer(vocab_file=dir_name + r"\vocab.txt", model_max_length=512)
print("Save tokenizer files to {}".format(pytorch_dump_path))
tokenizer.save_pretrained(pytorch_dump_path[:-17])
print("Used relative position embeddings:", model.config.reset_position_index_per_cell)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--task", default="SQA", type=str, help="Model task for which to convert a checkpoint. Defaults to SQA."
)
parser.add_argument(
"--reset_position_index_per_cell",
default=False,
action="store_true",
help="Whether to use relative position embeddings or not. Defaults to True.",
)
parser.add_argument(
"--tf_checkpoint_path", default=None, type=str, required=True, help="Path to the TensorFlow checkpoint path."
)
parser.add_argument(
"--tapas_config_file",
default=None,
type=str,
required=True,
help="The config json file corresponding to the pre-trained TAPAS model. \n"
"This specifies the model architecture.",
)
parser.add_argument(
"--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
)
args = parser.parse_args()
convert_tf_checkpoint_to_pytorch(
args.task,
args.reset_position_index_per_cell,
args.tf_checkpoint_path,
args.tapas_config_file,
args.pytorch_dump_path,
)
# coding=utf-8
# Copyright 2020 Google Research and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch TAPAS model. """
import enum
import math
import os
from dataclasses import dataclass
from typing import Optional, Tuple
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss, MSELoss
from ...activations import ACT2FN
from ...file_utils import (
ModelOutput,
add_start_docstrings,
add_start_docstrings_to_model_forward,
is_scatter_available,
replace_return_docstrings,
requires_scatter,
)
from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, MaskedLMOutput, SequenceClassifierOutput
from ...modeling_utils import (
PreTrainedModel,
apply_chunking_to_forward,
find_pruneable_heads_and_indices,
prune_linear_layer,
)
from ...utils import logging
from .configuration_tapas import TapasConfig
# soft dependency
if is_scatter_available():
from torch_scatter import scatter
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "TapasConfig"
_TOKENIZER_FOR_DOC = "TapasTokenizer"
TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST = [
# large models
"google/tapas-large",
"google/tapas-large-finetuned-sqa",
"google/tapas-large-finetuned-wtq",
"google/tapas-large-finetuned-wikisql-supervised",
"google/tapas-large-finetuned-tabfact",
# base models
"google/tapas-base",
"google/tapas-base-finetuned-sqa",
"google/tapas-base-finetuned-wtq",
"google/tapas-base-finetuned-wikisql-supervised",
"google/tapas-base-finetuned-tabfact",
# small models
"google/tapas-small",
"google/tapas-small-finetuned-sqa",
"google/tapas-small-finetuned-wtq",
"google/tapas-small-finetuned-wikisql-supervised",
"google/tapas-small-finetuned-tabfact",
# mini models
"google/tapas-mini",
"google/tapas-mini-finetuned-sqa",
"google/tapas-mini-finetuned-wtq",
"google/tapas-mini-finetuned-wikisql-supervised",
"google/tapas-mini-finetuned-tabfact",
# tiny models
"google/tapas-tiny",
"google/tapas-tiny-finetuned-sqa",
"google/tapas-tiny-finetuned-wtq",
"google/tapas-tiny-finetuned-wikisql-supervised",
"google/tapas-tiny-finetuned-tabfact",
# See all TAPAS models at https://huggingface.co/models?filter=tapas
]
EPSILON_ZERO_DIVISION = 1e-10
CLOSE_ENOUGH_TO_LOG_ZERO = -10000.0
@dataclass
class TableQuestionAnsweringOutput(ModelOutput):
"""
Output type of :class:`~transformers.TapasForQuestionAnswering`.
Args:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` (and possibly :obj:`answer`, :obj:`aggregation_labels`, :obj:`numeric_values` and :obj:`numeric_values_scale` are provided)):
Total loss as the sum of the hierarchical cell selection log-likelihood loss and (optionally) the
semi-supervised regression loss and (optionally) supervised loss for aggregations.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
Prediction scores of the cell selection head, for every token.
logits_aggregation (:obj:`torch.FloatTensor`, `optional`, of shape :obj:`(batch_size, num_aggregation_labels)`):
Prediction scores of the aggregation head, for every aggregation operator.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of
each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
"""
loss: Optional[torch.FloatTensor] = None
logits: torch.FloatTensor = None
logits_aggregation: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
def load_tf_weights_in_tapas(model, config, tf_checkpoint_path):
"""
Load tf checkpoints in a PyTorch model. This is an adaptation from load_tf_weights_in_bert
- add cell selection and aggregation heads
- take into account additional token type embedding layers
"""
try:
import re
import numpy as np
import tensorflow as tf
except ImportError:
logger.error(
"Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
"https://www.tensorflow.org/install/ for installation instructions."
)
raise
tf_path = os.path.abspath(tf_checkpoint_path)
logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
# Load weights from TF model
init_vars = tf.train.list_variables(tf_path)
names = []
arrays = []
for name, shape in init_vars:
logger.info("Loading TF weight {} with shape {}".format(name, shape))
array = tf.train.load_variable(tf_path, name)
names.append(name)
arrays.append(array)
for name, array in zip(names, arrays):
name = name.split("/")
# adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculate m and v
# which are not required for using pretrained model
if any(
n
in [
"adam_v",
"adam_m",
"AdamWeightDecayOptimizer",
"AdamWeightDecayOptimizer_1",
"global_step",
"seq_relationship",
]
for n in name
):
logger.info("Skipping {}".format("/".join(name)))
continue
# in case the model is TapasForSequenceClassification, we skip output_bias and output_weights
# since these are not used for classification
if isinstance(model, TapasForSequenceClassification):
if any(n in ["output_bias", "output_weights"] for n in name):
logger.info("Skipping {}".format("/".join(name)))
continue
# in case the model is TapasModel, we skip output_bias, output_weights, output_bias_cls and output_weights_cls
# since this model does not have MLM and NSP heads
if isinstance(model, TapasModel):
if any(n in ["output_bias", "output_weights", "output_bias_cls", "output_weights_cls"] for n in name):
logger.info("Skipping {}".format("/".join(name)))
continue
# if first scope name starts with "bert", change it to "tapas"
if name[0] == "bert":
name[0] = "tapas"
pointer = model
for m_name in name:
if re.fullmatch(r"[A-Za-z]+_\d+", m_name):
scope_names = re.split(r"_(\d+)", m_name)
else:
scope_names = [m_name]
if scope_names[0] == "kernel" or scope_names[0] == "gamma":
pointer = getattr(pointer, "weight")
elif scope_names[0] == "beta":
pointer = getattr(pointer, "bias")
# cell selection heads
elif scope_names[0] == "output_bias":
pointer = getattr(pointer, "output_bias")
elif scope_names[0] == "output_weights":
pointer = getattr(pointer, "output_weights")
elif scope_names[0] == "column_output_bias":
pointer = getattr(pointer, "column_output_bias")
elif scope_names[0] == "column_output_weights":
pointer = getattr(pointer, "column_output_weights")
# aggregation head
elif scope_names[0] == "output_bias_agg":
pointer = getattr(pointer, "aggregation_classifier")
pointer = getattr(pointer, "bias")
elif scope_names[0] == "output_weights_agg":
pointer = getattr(pointer, "aggregation_classifier")
pointer = getattr(pointer, "weight")
# classification head
elif scope_names[0] == "output_bias_cls":
pointer = getattr(pointer, "classifier")
pointer = getattr(pointer, "bias")
elif scope_names[0] == "output_weights_cls":
pointer = getattr(pointer, "classifier")
pointer = getattr(pointer, "weight")
else:
try:
pointer = getattr(pointer, scope_names[0])
except AttributeError:
logger.info("Skipping {}".format("/".join(name)))
continue
if len(scope_names) >= 2:
num = int(scope_names[1])
pointer = pointer[num]
if m_name[-11:] == "_embeddings":
pointer = getattr(pointer, "weight")
elif m_name[-13:] in [f"_embeddings_{i}" for i in range(7)]:
pointer = getattr(pointer, "weight")
elif m_name == "kernel":
array = np.transpose(array)
try:
assert (
pointer.shape == array.shape
), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
except AssertionError as e:
e.args += (pointer.shape, array.shape)
raise
logger.info("Initialize PyTorch weight {}".format(name))
# Added a check to see whether the array is a scalar (because bias terms in Tapas checkpoints can be
# scalar => should first be converted to numpy arrays)
if np.isscalar(array):
array = np.array(array)
pointer.data = torch.from_numpy(array)
return model
class TapasEmbeddings(nn.Module):
"""
Construct the embeddings from word, position and token_type embeddings. Same as BertEmbeddings but with a number of
additional token type embeddings to encode tabular structure.
"""
def __init__(self, config):
super().__init__()
# we do not include config.disabled_features and config.disable_position_embeddings from the original implementation
# word embeddings
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
# position embeddings
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
# token type embeddings
for i, type_vocab_sizes in enumerate(config.type_vocab_sizes):
name = f"token_type_embeddings_{i}"
setattr(self, name, nn.Embedding(type_vocab_sizes, config.hidden_size))
self.number_of_token_type_embeddings = len(config.type_vocab_sizes)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.config = config
def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
device = input_ids.device if input_ids is not None else inputs_embeds.device
if position_ids is None:
# create absolute position embeddings
position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
position_ids = position_ids.unsqueeze(0).expand(input_shape)
# when self.config.reset_position_index_per_cell is set to True, create relative position embeddings
if self.config.reset_position_index_per_cell:
# shape (batch_size, seq_len)
col_index = IndexMap(token_type_ids[:, :, 1], self.config.type_vocab_sizes[1], batch_dims=1)
# shape (batch_size, seq_len)
row_index = IndexMap(token_type_ids[:, :, 2], self.config.type_vocab_sizes[2], batch_dims=1)
# shape (batch_size, seq_len)
full_index = ProductIndexMap(col_index, row_index)
# shape (max_rows * max_columns,). First absolute position for every cell
first_position_per_segment = reduce_min(position_ids, full_index)[0]
# ? shape (batch_size, seq_len). First absolute position of the cell for every token
first_position = gather(first_position_per_segment, full_index)
# shape (1, seq_len)
position = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0)
position_ids = torch.min(
torch.as_tensor(self.config.max_position_embeddings - 1, device=device), position - first_position
)
if token_type_ids is None:
token_type_ids = torch.zeros(
(input_shape + self.number_of_token_type_embeddings), dtype=torch.long, device=device
)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
embeddings = inputs_embeds + position_embeddings
for i in range(self.number_of_token_type_embeddings):
name = f"token_type_embeddings_{i}"
embeddings += getattr(self, name)(token_type_ids[:, :, i])
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class TapasSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
f"The hidden size {config.hidden_size} is not a multiple of the number of attention "
f"heads {config.num_attention_heads}"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
if encoder_hidden_states is not None:
mixed_key_layer = self.key(encoder_hidden_states)
mixed_value_layer = self.value(encoder_hidden_states)
attention_mask = encoder_attention_mask
else:
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in TapasModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs
# Copied from transformers.models.bert.modeling_bert.BertSelfOutput
class TapasSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Tapas
class TapasAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.self = TapasSelfAttention(config)
self.output = TapasSelfOutput(config)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
output_attentions,
)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
# Copied from transformers.models.bert.modeling_bert.BertIntermediate
class TapasIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertOutput
class TapasOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
# Copied from transformers.models.bert.modeling_bert.BertLayer with Bert->Tapas
class TapasLayer(nn.Module):
def __init__(self, config):
super().__init__()
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = TapasAttention(config)
self.is_decoder = config.is_decoder
self.add_cross_attention = config.add_cross_attention
if self.add_cross_attention:
assert self.is_decoder, f"{self} should be used as a decoder model if cross attention is added"
self.crossattention = TapasAttention(config)
self.intermediate = TapasIntermediate(config)
self.output = TapasOutput(config)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
)
attention_output = self_attention_outputs[0]
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
if self.is_decoder and encoder_hidden_states is not None:
assert hasattr(
self, "crossattention"
), f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`"
cross_attention_outputs = self.crossattention(
attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
output_attentions,
)
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:] # add cross attentions if we output attention weights
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
)
outputs = (layer_output,) + outputs
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
class TapasEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList([TapasLayer(config) for _ in range(config.num_hidden_layers)])
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
):
all_hidden_states = () if output_hidden_states else None
all_attentions = () if output_attentions else None
for i, layer_module in enumerate(self.layer):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_head_mask = head_mask[i] if head_mask is not None else None
if getattr(self.config, "gradient_checkpointing", False):
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
)
else:
layer_outputs = layer_module(
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
output_attentions,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
return BaseModelOutput(
last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
)
# Copied from transformers.models.bert.modeling_bert.BertPooler
class TapasPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
class TapasPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = TapasConfig
base_model_prefix = "tapas"
# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
def _init_weights(self, module):
""" Initialize the weights """
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()
TAPAS_START_DOCSTRING = r"""
This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
methods the library implements for all its models (such as downloading or saving, resizing the input embeddings,
pruning heads etc.)
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
general usage and behavior.
Parameters:
config (:class:`~transformers.TapasConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
weights.
"""
TAPAS_INPUTS_DOCSTRING = r"""
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
Indices of input sequence tokens in the vocabulary. Indices can be obtained using
:class:`~transformers.TapasTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__
token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0}, 7)`, `optional`):
Token indices that encode tabular structure. Indices can be obtained using
:class:`~transformers.TapasTokenizer`. See this class for more info.
`What are token type IDs? <../glossary.html#token-type-ids>`_
position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
Indices of positions of each input sequence tokens in the position embeddings. If
``reset_position_index_per_cell`` of :class:`~transformers.TapasConfig` is set to ``True``, relative
position embeddings will be used. Selected in the range ``[0, config.max_position_embeddings - 1]``.
`What are position IDs? <../glossary.html#position-ids>`_
head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``: - 1
indicates the head is **not masked**, - 0 indicates the head is **masked**.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
vectors than the model's internal embedding lookup matrix.
output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
tensors for more detail.
output_hidden_states (:obj:`bool`, `optional`):
Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
more detail.
return_dict (:obj:`bool`, `optional`):
Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
"""
@add_start_docstrings(
"The bare Tapas Model transformer outputting raw hidden-states without any specific head on top.",
TAPAS_START_DOCSTRING,
)
class TapasModel(TapasPreTrainedModel):
"""
This class is a small change compared to :class:`~transformers.BertModel`, taking into account the additional token
type ids.
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in `Attention is
all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
"""
def __init__(self, config, add_pooling_layer=True):
requires_scatter(self)
super().__init__(config)
self.config = config
self.embeddings = TapasEmbeddings(config)
self.encoder = TapasEncoder(config)
self.pooler = TapasPooler(config) if add_pooling_layer else None
self.init_weights()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
Returns:
Examples::
>>> from transformers import TapasTokenizer, TapasModel
>>> import pandas as pd
>>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base')
>>> model = TapasModel.from_pretrained('google/tapas-base')
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
... 'Age': ["56", "45", "59"],
... 'Number of movies': ["87", "53", "69"]
... }
>>> table = pd.DataFrame.from_dict(data)
>>> queries = ["How many movies has George Clooney played in?", "How old is Brad Pitt?"]
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
device = input_ids.device if input_ids is not None else inputs_embeds.device
if attention_mask is None:
attention_mask = torch.ones(input_shape, device=device)
if token_type_ids is None:
token_type_ids = torch.zeros(
(*input_shape, len(self.config.type_vocab_sizes)), dtype=torch.long, device=device
)
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
# If a 2D ou 3D attention mask is provided for the cross-attention
# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
if self.config.is_decoder and encoder_hidden_states is not None:
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
if encoder_attention_mask is None:
encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
else:
encoder_extended_attention_mask = None
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
embedding_output = self.embeddings(
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPooling(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
@add_start_docstrings("""Tapas Model with a `language modeling` head on top. """, TAPAS_START_DOCSTRING)
class TapasForMaskedLM(TapasPreTrainedModel):
config_class = TapasConfig
base_model_prefix = "tapas"
def __init__(self, config):
super().__init__(config)
self.tapas = TapasModel(config, add_pooling_layer=False)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
self.init_weights()
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, word_embeddings):
self.lm_head = word_embeddings
@add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=MaskedLMOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
**kwargs
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
Returns:
Examples::
>>> from transformers import TapasTokenizer, TapasForMaskedLM
>>> import pandas as pd
>>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base')
>>> model = TapasForMaskedLM.from_pretrained('google/tapas-base')
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
... 'Age': ["56", "45", "59"],
... 'Number of movies': ["87", "53", "69"]
... }
>>> table = pd.DataFrame.from_dict(data)
>>> inputs = tokenizer(table=table, queries="How many [MASK] has George [MASK] played in?", return_tensors="pt")
>>> labels = tokenizer(table=table, queries="How many movies has George Clooney played in?", return_tensors="pt")["input_ids"]
>>> outputs = model(**inputs, labels=labels)
>>> last_hidden_states = outputs.last_hidden_state
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.tapas(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
prediction_scores = self.lm_head(sequence_output)
masked_lm_loss = None
if labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
if not return_dict:
output = (prediction_scores,) + outputs[2:]
return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
return MaskedLMOutput(
loss=masked_lm_loss,
logits=prediction_scores,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""
Tapas Model with a cell selection head and optional aggregation head on top for question-answering tasks on tables
(linear layers on top of the hidden-states output to compute `logits` and optional `logits_aggregation`), e.g. for
SQA, WTQ or WikiSQL-supervised tasks.
""",
TAPAS_START_DOCSTRING,
)
class TapasForQuestionAnswering(TapasPreTrainedModel):
def __init__(self, config: TapasConfig):
super().__init__(config)
# base model
self.tapas = TapasModel(config)
# dropout (only used when training)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# cell selection heads
if config.init_cell_selection_weights_to_zero:
# init_cell_selection_weights_to_zero: Whether the initial weights should be
# set to 0. This ensures that all tokens have the same prior probability.
self.output_weights = nn.Parameter(torch.zeros(config.hidden_size))
self.column_output_weights = nn.Parameter(torch.zeros(config.hidden_size))
else:
self.output_weights = nn.Parameter(torch.empty(config.hidden_size))
nn.init.normal_(
self.output_weights, std=config.initializer_range
) # here, a truncated normal is used in the original implementation
self.column_output_weights = nn.Parameter(torch.empty(config.hidden_size))
nn.init.normal_(
self.column_output_weights, std=config.initializer_range
) # here, a truncated normal is used in the original implementation
self.output_bias = nn.Parameter(torch.zeros([]))
self.column_output_bias = nn.Parameter(torch.zeros([]))
# aggregation head
if config.num_aggregation_labels > 0:
self.aggregation_classifier = nn.Linear(config.hidden_size, config.num_aggregation_labels)
self.init_weights()
@add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=TableQuestionAnsweringOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
table_mask=None,
labels=None,
aggregation_labels=None,
float_answer=None,
numeric_values=None,
numeric_values_scale=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
table_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
Mask for the table. Indicates which tokens belong to the table (1). Question tokens, table headers and
padding are 0.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
Labels per token for computing the hierarchical cell selection loss. This encodes the positions of the
answer appearing in the table. Can be obtained using :class:`~transformers.TapasTokenizer`.
- 1 for tokens that are **part of the answer**,
- 0 for tokens that are **not part of the answer**.
aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`, `optional`):
Aggregation function index for every example in the batch for computing the aggregation loss. Indices
should be in :obj:`[0, ..., config.num_aggregation_labels - 1]`. Only required in case of strong
supervision for aggregation (WikiSQL-supervised).
float_answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`, `optional`):
Float answer for every example in the batch. Set to `float('nan')` for cell selection questions. Only
required in case of weak supervision (WTQ) to calculate the aggregate mask and regression loss.
numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
Numeric values of every token, NaN for tokens which are not numeric values. Can be obtained using
:class:`~transformers.TapasTokenizer`. Only required in case of weak supervision for aggregation (WTQ) to
calculate the regression loss.
numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
Scale of the numeric values of every token. Can be obtained using :class:`~transformers.TapasTokenizer`.
Only required in case of weak supervision for aggregation (WTQ) to calculate the regression loss.
Returns:
Examples::
>>> from transformers import TapasTokenizer, TapasForQuestionAnswering
>>> import pandas as pd
>>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-finetuned-wtq')
>>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-finetuned-wtq')
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
... 'Age': ["56", "45", "59"],
... 'Number of movies': ["87", "53", "69"]
... }
>>> table = pd.DataFrame.from_dict(data)
>>> queries = ["How many movies has George Clooney played in?", "How old is Brad Pitt?"]
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> logits_aggregation = outputs.logits_aggregation
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.tapas(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
pooled_output = outputs[1]
sequence_output = self.dropout(sequence_output)
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
device = input_ids.device if input_ids is not None else inputs_embeds.device
# Construct indices for the table.
if token_type_ids is None:
token_type_ids = torch.zeros(
(*input_shape, len(self.config.type_vocab_sizes)), dtype=torch.long, device=device
)
token_types = [
"segment_ids",
"column_ids",
"row_ids",
"prev_labels",
"column_ranks",
"inv_column_ranks",
"numeric_relations",
]
row_ids = token_type_ids[:, :, token_types.index("row_ids")]
column_ids = token_type_ids[:, :, token_types.index("column_ids")]
row_index = IndexMap(
indices=torch.min(row_ids, torch.as_tensor(self.config.max_num_rows - 1, device=row_ids.device)),
num_segments=self.config.max_num_rows,
batch_dims=1,
)
col_index = IndexMap(
indices=torch.min(column_ids, torch.as_tensor(self.config.max_num_columns - 1, device=column_ids.device)),
num_segments=self.config.max_num_columns,
batch_dims=1,
)
cell_index = ProductIndexMap(row_index, col_index)
# Masks.
input_shape = input_ids.size() if input_ids is not None else inputs_embeds.size()[:-1]
device = input_ids.device if input_ids is not None else inputs_embeds.device
if attention_mask is None:
attention_mask = torch.ones(input_shape, device=device)
# Table cells only, without question tokens and table headers.
if table_mask is None:
table_mask = torch.where(row_ids > 0, torch.ones_like(row_ids), torch.zeros_like(row_ids))
# torch.FloatTensor[batch_size, seq_length]
input_mask_float = attention_mask.float().to(device)
table_mask_float = table_mask.float().to(device)
# Mask for cells that exist in the table (i.e. that are not padding).
cell_mask, _ = reduce_mean(input_mask_float, cell_index)
# Compute logits per token. These are used to select individual cells.
logits = compute_token_logits(sequence_output, self.config.temperature, self.output_weights, self.output_bias)
# Compute logits per column. These are used to select a column.
column_logits = None
if self.config.select_one_column:
column_logits = compute_column_logits(
sequence_output,
self.column_output_weights,
self.column_output_bias,
cell_index,
cell_mask,
self.config.allow_empty_column_selection,
)
# Aggregation logits
logits_aggregation = None
if self.config.num_aggregation_labels > 0:
logits_aggregation = self.aggregation_classifier(pooled_output)
# Total loss calculation
total_loss = 0.0
calculate_loss = False
if labels is not None:
calculate_loss = True
is_supervised = not self.config.num_aggregation_labels > 0 or not self.config.use_answer_as_supervision
# Semi-supervised cell selection in case of no aggregation:
# If the answer (the denotation) appears directly in the table we might
# select the answer without applying any aggregation function. There are
# some ambiguous cases, see utils._calculate_aggregate_mask for more info.
# `aggregate_mask` is 1 for examples where we chose to aggregate and 0
# for examples where we chose to select the answer directly.
# `labels` encodes the positions of the answer appearing in the table.
if is_supervised:
aggregate_mask = None
else:
if float_answer is not None:
assert (
labels.shape[0] == float_answer.shape[0]
), "Make sure the answers are a FloatTensor of shape (batch_size,)"
# <float32>[batch_size]
aggregate_mask = _calculate_aggregate_mask(
float_answer,
pooled_output,
self.config.cell_selection_preference,
labels,
self.aggregation_classifier,
)
else:
raise ValueError("You have to specify float answers in order to calculate the aggregate mask")
# Cell selection log-likelihood
if self.config.average_logits_per_cell:
logits_per_cell, _ = reduce_mean(logits, cell_index)
logits = gather(logits_per_cell, cell_index)
dist_per_token = torch.distributions.Bernoulli(logits=logits)
# Compute cell selection loss per example.
selection_loss_per_example = None
if not self.config.select_one_column:
weight = torch.where(
labels == 0,
torch.ones_like(labels, dtype=torch.float32),
self.config.positive_label_weight * torch.ones_like(labels, dtype=torch.float32),
)
selection_loss_per_token = -dist_per_token.log_prob(labels) * weight
selection_loss_per_example = torch.sum(selection_loss_per_token * input_mask_float, dim=1) / (
torch.sum(input_mask_float, dim=1) + EPSILON_ZERO_DIVISION
)
else:
selection_loss_per_example, logits = _single_column_cell_selection_loss(
logits, column_logits, labels, cell_index, col_index, cell_mask
)
dist_per_token = torch.distributions.Bernoulli(logits=logits)
# Supervised cell selection
if self.config.disable_per_token_loss:
pass
elif is_supervised:
total_loss += torch.mean(selection_loss_per_example)
else:
# For the not supervised case, do not assign loss for cell selection
total_loss += torch.mean(selection_loss_per_example * (1.0 - aggregate_mask))
# Semi-supervised regression loss and supervised loss for aggregations
if self.config.num_aggregation_labels > 0:
if is_supervised:
# Note that `aggregate_mask` is None if the setting is supervised.
if aggregation_labels is not None:
assert (
labels.shape[0] == aggregation_labels.shape[0]
), "Make sure the aggregation labels are a LongTensor of shape (batch_size,)"
per_example_additional_loss = _calculate_aggregation_loss(
logits_aggregation,
aggregate_mask,
aggregation_labels,
self.config.use_answer_as_supervision,
self.config.num_aggregation_labels,
self.config.aggregation_loss_weight,
)
else:
raise ValueError(
"You have to specify aggregation labels in order to calculate the aggregation loss"
)
else:
# Set aggregation labels to zeros
aggregation_labels = torch.zeros(labels.shape[0], dtype=torch.long, device=labels.device)
per_example_additional_loss = _calculate_aggregation_loss(
logits_aggregation,
aggregate_mask,
aggregation_labels,
self.config.use_answer_as_supervision,
self.config.num_aggregation_labels,
self.config.aggregation_loss_weight,
)
if self.config.use_answer_as_supervision:
if numeric_values is not None and numeric_values_scale is not None:
assert numeric_values.shape == numeric_values_scale.shape
# Add regression loss for numeric answers which require aggregation.
answer_loss, large_answer_loss_mask = _calculate_regression_loss(
float_answer,
aggregate_mask,
dist_per_token,
numeric_values,
numeric_values_scale,
table_mask_float,
logits_aggregation,
self.config,
)
per_example_additional_loss += answer_loss
# Zero loss for examples with answer_loss > cutoff.
per_example_additional_loss *= large_answer_loss_mask
else:
raise ValueError(
"You have to specify numeric values and numeric values scale in order to calculate the regression loss"
)
total_loss += torch.mean(per_example_additional_loss)
else:
# if no label ids are provided, set them to zeros in order to properly compute logits
labels = torch.zeros_like(logits)
_, logits = _single_column_cell_selection_loss(
logits, column_logits, labels, cell_index, col_index, cell_mask
)
if not return_dict:
output = (logits, logits_aggregation) + outputs[2:]
return ((total_loss,) + output) if calculate_loss else output
return TableQuestionAnsweringOutput(
loss=total_loss if calculate_loss else None,
logits=logits,
logits_aggregation=logits_aggregation,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@add_start_docstrings(
"""
Tapas Model with a sequence classification head on top (a linear layer on top of the pooled output), e.g. for table
entailment tasks, such as TabFact (Chen et al., 2020).
""",
TAPAS_START_DOCSTRING,
)
class TapasForSequenceClassification(TapasPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.tapas = TapasModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
@add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). Note: this is called
"classification_class_index" in the original implementation.
Returns:
Examples::
>>> from transformers import TapasTokenizer, TapasForSequenceClassification
>>> import torch
>>> import pandas as pd
>>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-finetuned-tabfact')
>>> model = TapasForSequenceClassification.from_pretrained('google/tapas-base-finetuned-tabfact')
>>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
... 'Age': ["56", "45", "59"],
... 'Number of movies': ["87", "53", "69"]
... }
>>> table = pd.DataFrame.from_dict(data)
>>> queries = ["There is only one actor who is 45 years old", "There are 3 actors which played in more than 60 movies"]
>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
>>> labels = torch.tensor([1, 0]) # 1 means entailed, 0 means refuted
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.tapas(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
loss = None
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
""" TAPAS utilities."""
class AverageApproximationFunction(str, enum.Enum):
RATIO = "ratio"
FIRST_ORDER = "first_order"
SECOND_ORDER = "second_order"
# Beginning of everything related to segmented tensors
class IndexMap(object):
"""Index grouping entries within a tensor."""
def __init__(self, indices, num_segments, batch_dims=0):
"""
Creates an index
Args:
indices (:obj:`torch.LongTensor`, same shape as a `values` Tensor to which the indices refer):
Tensor containing the indices.
num_segments (:obj:`torch.LongTensor`):
Scalar tensor, the number of segments. All elements in a batched segmented tensor must have the same
number of segments (although many segments can be empty).
batch_dims (:obj:`int`, `optional`, defaults to 0):
The number of batch dimensions. The first `batch_dims` dimensions of a SegmentedTensor are treated as
batch dimensions. Segments in different batch elements are always distinct even if they have the same
index.
"""
self.indices = torch.as_tensor(indices)
self.num_segments = torch.as_tensor(num_segments, device=indices.device)
self.batch_dims = batch_dims
def batch_shape(self):
return self.indices.size()[: self.batch_dims] # returns a torch.Size object
class ProductIndexMap(IndexMap):
"""The product of two indices."""
def __init__(self, outer_index, inner_index):
"""
Combines indices i and j into pairs (i, j). The result is an index where each segment (i, j) is the
intersection of segments i and j. For example if the inputs represent table cells indexed by respectively rows
and columns the output will be a table indexed by (row, column) pairs, i.e. by cell. The implementation
combines indices {0, .., n - 1} and {0, .., m - 1} into {0, .., nm - 1}. The output has `num_segments` equal to
`outer_index.num_segments` * `inner_index.num_segments`
Args:
outer_index (:obj:`IndexMap`):
IndexMap.
inner_index (:obj:`IndexMap`):
IndexMap, must have the same shape as `outer_index`.
"""
if outer_index.batch_dims != inner_index.batch_dims:
raise ValueError("outer_index.batch_dims and inner_index.batch_dims must be the same.")
super().__init__(
indices=(inner_index.indices + outer_index.indices * inner_index.num_segments),
num_segments=inner_index.num_segments * outer_index.num_segments,
batch_dims=inner_index.batch_dims,
)
self.outer_index = outer_index
self.inner_index = inner_index
def project_outer(self, index):
"""Projects an index with the same index set onto the outer components."""
return IndexMap(
indices=(index.indices // self.inner_index.num_segments).type(torch.float).floor().type(torch.long),
num_segments=self.outer_index.num_segments,
batch_dims=index.batch_dims,
)
def project_inner(self, index):
"""Projects an index with the same index set onto the inner components."""
return IndexMap(
indices=torch.fmod(index.indices, self.inner_index.num_segments)
.type(torch.float)
.floor()
.type(torch.long),
num_segments=self.inner_index.num_segments,
batch_dims=index.batch_dims,
)
def gather(values, index, name="segmented_gather"):
"""
Gathers from `values` using the index map. For each element in the domain of the index map this operation looks up
a value for that index in `values`. Two elements from the same segment always get assigned the same value.
Args:
values (:obj:`torch.Tensor` of shape (B1, ..., Bn, num_segments, V1, ...)):
Tensor with segment values.
index (:obj:`IndexMap` of shape (B1, ..., Bn, I1, ..., Ik)):
IndexMap.
name (:obj:`str`, `optional`, defaults to 'segmented_gather'):
Name for the operation. Currently not used
Returns:
:obj:`tuple(torch.Tensor)`: Tensor of shape (B1, ..., Bn, I1, ..., Ik, V1, ...) with the gathered values.
"""
indices = index.indices
# first, check whether the indices of the index represent scalar values (i.e. not vectorized)
if len(values.shape[index.batch_dims :]) < 2:
return torch.gather(
values,
index.batch_dims,
indices.view(
values.size()[0], -1
), # torch.gather expects index to have the same number of dimensions as values
).view(indices.size())
else:
# this means we have a vectorized version
# we have to adjust the index
indices = indices.unsqueeze(-1).expand(values.shape)
return torch.gather(values, index.batch_dims, indices)
def flatten(index, name="segmented_flatten"):
"""
Flattens a batched index map (which is typically of shape batch_size, seq_length) to a 1d index map. This operation
relabels the segments to keep batch elements distinct. The k-th batch element will have indices shifted by
`num_segments` * (k - 1). The result is a tensor with `num_segments` multiplied by the number of elements in the
batch.
Args:
index (:obj:`IndexMap`):
IndexMap to flatten.
name (:obj:`str`, `optional`, defaults to 'segmented_flatten'):
Name for the operation. Currently not used
Returns:
(:obj:`IndexMap`): The flattened IndexMap.
"""
# first, get batch_size as scalar tensor
batch_size = torch.prod(torch.tensor(list(index.batch_shape())))
# next, create offset as 1-D tensor of length batch_size,
# and multiply element-wise by num segments (to offset different elements in the batch) e.g. if batch size is 2: [0, 64]
offset = torch.arange(start=0, end=batch_size, device=index.num_segments.device) * index.num_segments
offset = offset.view(index.batch_shape())
for _ in range(index.batch_dims, len(index.indices.size())): # typically range(1,2)
offset = offset.unsqueeze(-1)
indices = offset + index.indices
return IndexMap(indices=indices.view(-1), num_segments=index.num_segments * batch_size, batch_dims=0)
def range_index_map(batch_shape, num_segments, name="range_index_map"):
"""
Constructs an index map equal to range(num_segments).
Args:
batch_shape (:obj:`torch.Size`):
Batch shape
num_segments (:obj:`int`):
Number of segments
name (:obj:`str`, `optional`, defaults to 'range_index_map'):
Name for the operation. Currently not used
Returns:
(:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments).
"""
batch_shape = torch.as_tensor(
batch_shape, dtype=torch.long
) # create a rank 1 tensor vector containing batch_shape (e.g. [2])
assert len(batch_shape.size()) == 1
num_segments = torch.as_tensor(num_segments) # create a rank 0 tensor (scalar) containing num_segments (e.g. 64)
assert len(num_segments.size()) == 0
indices = torch.arange(
start=0, end=num_segments, device=num_segments.device
) # create a rank 1 vector with num_segments elements
new_tensor = torch.cat(
[torch.ones_like(batch_shape, dtype=torch.long, device=num_segments.device), num_segments.unsqueeze(dim=0)],
dim=0,
)
# new_tensor is just a vector of [1 64] for example (assuming only 1 batch dimension)
new_shape = [int(x) for x in new_tensor.tolist()]
indices = indices.view(new_shape)
multiples = torch.cat([batch_shape, torch.as_tensor([1])], dim=0)
indices = indices.repeat(multiples.tolist())
# equivalent (in Numpy:)
# indices = torch.as_tensor(np.tile(indices.numpy(), multiples.tolist()))
return IndexMap(indices=indices, num_segments=num_segments, batch_dims=list(batch_shape.size())[0])
def _segment_reduce(values, index, segment_reduce_fn, name):
"""
Applies a segment reduction segment-wise.
Args:
values (:obj:`torch.Tensor`):
Tensor with segment values.
index (:obj:`IndexMap`):
IndexMap.
segment_reduce_fn (:obj:`str`):
Name for the reduce operation. One of "sum", "mean", "max" or "min".
name (:obj:`str`):
Name for the operation. Currently not used
Returns:
(:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments).
"""
# Flatten the batch dimensions, as segments ops (scatter) do not support batching.
# However if `values` has extra dimensions to the right keep them
# unflattened. Segmented ops support vector-valued operations.
flat_index = flatten(index)
vector_shape = values.size()[len(index.indices.size()) :] # torch.Size object
flattened_shape = torch.cat(
[torch.as_tensor([-1], dtype=torch.long), torch.as_tensor(vector_shape, dtype=torch.long)], dim=0
)
# changed "view" by "reshape" in the following line
flat_values = values.reshape(flattened_shape.tolist())
segment_means = scatter(
src=flat_values,
index=flat_index.indices.type(torch.long),
dim=0,
dim_size=flat_index.num_segments,
reduce=segment_reduce_fn,
)
# Unflatten the values.
new_shape = torch.cat(
[
torch.as_tensor(index.batch_shape(), dtype=torch.long),
torch.as_tensor([index.num_segments], dtype=torch.long),
torch.as_tensor(vector_shape, dtype=torch.long),
],
dim=0,
)
output_values = segment_means.view(new_shape.tolist())
output_index = range_index_map(index.batch_shape(), index.num_segments)
return output_values, output_index
def reduce_sum(values, index, name="segmented_reduce_sum"):
"""
Sums a tensor over its segments.
Outputs 0 for empty segments.
This operations computes the sum over segments, with support for:
- Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices.
- Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be a sum of
vectors rather than scalars. Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
Args:
values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
Tensor containing the values of which the sum must be taken segment-wise.
index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
Index defining the segments.
name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
Name for the operation. Currently not used
Returns:
output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. .
"""
return _segment_reduce(values, index, "sum", name)
def reduce_mean(values, index, name="segmented_reduce_mean"):
"""
Averages a tensor over its segments.
Outputs 0 for empty segments.
This operations computes the mean over segments, with support for:
- Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices.
- Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be a mean of
vectors rather than scalars.
Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
Args:
values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
Tensor containing the values of which the mean must be taken segment-wise.
index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
Index defining the segments.
name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
Name for the operation. Currently not used
Returns:
output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments].
"""
return _segment_reduce(values, index, "mean", name)
def reduce_max(values, index, name="segmented_reduce_max"):
"""
Computes the maximum over segments.
This operation computes the maximum over segments, with support for:
- Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices.
- Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be an element-wise
maximum of vectors rather than scalars.
Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
Args:
values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
Tensor containing the values of which the max must be taken segment-wise.
index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
Index defining the segments.
name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
Name for the operation. Currently not used
Returns:
output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments].
"""
return _segment_reduce(values, index, "max", name)
def reduce_min(values, index, name="segmented_reduce_min"):
"""
Computes the minimum over segments.
This operations computes the minimum over segments, with support for:
- Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices.
- Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be an element-wise
minimum of vectors rather than scalars.
Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
Args:
values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
Tensor containing the values of which the min must be taken segment-wise.
index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
Index defining the segments.
name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
Name for the operation. Currently not used
Returns:
output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments].
"""
return _segment_reduce(values, index, "min", name)
# End of everything related to segmented tensors
def compute_column_logits(
sequence_output, column_output_weights, column_output_bias, cell_index, cell_mask, allow_empty_column_selection
):
"""
Computes the column logits.
Args:
sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model.
column_output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size)`):
Weights of the linear layer for column selection.
column_output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`):
Bias of the linear layer for column selection.
cell_index (:obj:`ProductIndexMap`):
Index that groups tokens into cells.
cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`):
Mask for cells that exist in the table (i.e. that are not padding).
allow_empty_column_selection (:obj:`bool`):
Whether to allow not to select any column
Returns:
column_logits (:obj:`torch.FloatTensor`of shape :obj:`(batch_size, max_num_cols)`): Tensor containing the
column logits for every example in the batch.
"""
# First, compute the token logits (batch_size, seq_len) - without temperature
token_logits = torch.einsum("bsj,j->bs", sequence_output, column_output_weights) + column_output_bias
# Next, average the logits per cell (batch_size, max_num_cols*max_num_rows)
cell_logits, cell_logits_index = reduce_mean(token_logits, cell_index)
# Finally, average the logits per column (batch_size, max_num_cols)
column_index = cell_index.project_inner(cell_logits_index)
column_logits, out_index = reduce_sum(cell_logits * cell_mask, column_index)
cell_count, _ = reduce_sum(cell_mask, column_index)
column_logits /= cell_count + EPSILON_ZERO_DIVISION
# Mask columns that do not appear in the example.
is_padding = torch.logical_and(cell_count < 0.5, ~torch.eq(out_index.indices, 0))
column_logits += CLOSE_ENOUGH_TO_LOG_ZERO * torch.as_tensor(
is_padding, dtype=torch.float32, device=is_padding.device
)
if not allow_empty_column_selection:
column_logits += CLOSE_ENOUGH_TO_LOG_ZERO * torch.as_tensor(
torch.eq(out_index.indices, 0), dtype=torch.float32, device=out_index.indices.device
)
return column_logits
def _single_column_cell_selection_loss(token_logits, column_logits, labels, cell_index, col_index, cell_mask):
"""
Computes the loss for cell selection constrained to a single column. The loss is a hierarchical log-likelihood. The
model first predicts a column and then selects cells within that column (conditioned on the column). Cells outside
the selected column are never selected.
Args:
token_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
Tensor containing the logits per token.
column_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_cols)`):
Tensor containing the logits per column.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Labels per token.
cell_index (:obj:`ProductIndexMap`):
Index that groups tokens into cells.
col_index (:obj:`IndexMap`):
Index that groups tokens into columns.
cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`):
Mask for cells that exist in the table (i.e. that are not padding).
Returns:
selection_loss_per_example (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Loss for each example.
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): New logits which are only
allowed to select cells in a single column. Logits outside of the most likely column according to
`column_logits` will be set to a very low value (such that the probabilities are 0).
"""
# Part 1: column loss
# First find the column we should select. We use the column with maximum number of selected cells.
labels_per_column, _ = reduce_sum(torch.as_tensor(labels, dtype=torch.float32, device=labels.device), col_index)
# shape of labels_per_column is (batch_size, max_num_cols). It contains the number of label ids for every column, for every example
column_label = torch.argmax(labels_per_column, dim=-1) # shape (batch_size,)
# Check if there are no selected cells in the column. In that case the model
# should predict the special column id 0, which means "select nothing".
no_cell_selected = torch.eq(
torch.max(labels_per_column, dim=-1)[0], 0
) # no_cell_selected is of shape (batch_size,) and equals True
# if an example of the batch has no cells selected (i.e. if there are no labels set to 1 for that example)
column_label = torch.where(
no_cell_selected.view(column_label.size()), torch.zeros_like(column_label), column_label
)
column_dist = torch.distributions.Categorical(logits=column_logits) # shape (batch_size, max_num_cols)
column_loss_per_example = -column_dist.log_prob(column_label)
# Part 2: cell loss
# Reduce the labels and logits to per-cell from per-token.
# logits_per_cell: shape (batch_size, max_num_rows*max_num_cols) i.e. (batch_size, 64*32)
logits_per_cell, _ = reduce_mean(token_logits, cell_index)
# labels_per_cell: shape (batch_size, 64*32), indicating whether each cell should be selected (1) or not (0)
labels_per_cell, labels_index = reduce_max(
torch.as_tensor(labels, dtype=torch.long, device=labels.device), cell_index
)
# Mask for the selected column.
# column_id_for_cells: shape (batch_size, 64*32), indicating to which column each cell belongs
column_id_for_cells = cell_index.project_inner(labels_index).indices
# column_mask: shape (batch_size, 64*32), equal to 1 if cell belongs to column to be selected
column_mask = torch.as_tensor(
torch.eq(column_id_for_cells, torch.unsqueeze(column_label, dim=-1)),
dtype=torch.float32,
device=cell_mask.device,
)
# Compute the log-likelihood for cells, but only for the selected column.
cell_dist = torch.distributions.Bernoulli(logits=logits_per_cell) # shape (batch_size, 64*32)
cell_log_prob = cell_dist.log_prob(labels_per_cell.type(torch.float32)) # shape(batch_size, 64*32)
cell_loss = -torch.sum(cell_log_prob * column_mask * cell_mask, dim=1)
# We need to normalize the loss by the number of cells in the column.
cell_loss /= torch.sum(column_mask * cell_mask, dim=1) + EPSILON_ZERO_DIVISION
selection_loss_per_example = column_loss_per_example
selection_loss_per_example += torch.where(
no_cell_selected.view(selection_loss_per_example.size()),
torch.zeros_like(selection_loss_per_example),
cell_loss,
)
# Set the probs outside the selected column (selected by the *model*)
# to 0. This ensures backwards compatibility with models that select
# cells from multiple columns.
selected_column_id = torch.as_tensor(
torch.argmax(column_logits, dim=-1), dtype=torch.long, device=column_logits.device
) # shape (batch_size,)
# selected_column_mask: shape (batch_size, 64*32), equal to 1 if cell belongs to column selected by the model
selected_column_mask = torch.as_tensor(
torch.eq(column_id_for_cells, torch.unsqueeze(selected_column_id, dim=-1)),
dtype=torch.float32,
device=selected_column_id.device,
)
# Never select cells with the special column id 0.
selected_column_mask = torch.where(
torch.eq(column_id_for_cells, 0).view(selected_column_mask.size()),
torch.zeros_like(selected_column_mask),
selected_column_mask,
)
new_logits_per_cell = logits_per_cell + CLOSE_ENOUGH_TO_LOG_ZERO * (1.0 - cell_mask * selected_column_mask)
logits = gather(new_logits_per_cell, cell_index)
return selection_loss_per_example, logits
def compute_token_logits(sequence_output, temperature, output_weights, output_bias):
"""
Computes logits per token
Args:
sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model.
temperature (:obj:`float`):
Temperature for the Bernoulli distribution.
output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size,)`):
Weights of the linear layer for cell selection.
output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`):
Bias of the linear layer for cell selection
Returns:
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): Logits per token.
"""
logits = (torch.einsum("bsj,j->bs", sequence_output, output_weights) + output_bias) / temperature
return logits
def _calculate_aggregate_mask(answer, pooled_output, cell_selection_preference, labels, aggregation_classifier):
"""
Finds examples where the model should select cells with no aggregation.
Returns a mask that determines for which examples should the model select answers directly from the table, without
any aggregation function. If the answer is a piece of text the case is unambiguous as aggregation functions only
apply to numbers. If the answer is a number but does not appear in the table then we must use some aggregation
case. The ambiguous case is when the answer is a number that also appears in the table. In this case we use the
aggregation function probabilities predicted by the model to decide whether to select or aggregate. The threshold
for this is a hyperparameter `cell_selection_preference
Args:
answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
Answer for every example in the batch. Nan if there is no scalar answer.
pooled_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`):
Output of the pooler (BertPooler) on top of the encoder layer.
cell_selection_preference (:obj:`float`):
Preference for cell selection in ambiguous cases.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Labels per token. aggregation_classifier (:obj:`torch.nn.Linear`): Aggregation head
Returns:
aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A mask set to 1 for examples that
should use aggregation functions.
"""
# torch.FloatTensor(batch_size,)
aggregate_mask_init = torch.logical_not(torch.isnan(answer)).type(torch.FloatTensor).to(answer.device)
logits_aggregation = aggregation_classifier(pooled_output)
dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation)
# Index 0 correponds to "no aggregation".
aggregation_ops_total_mass = torch.sum(dist_aggregation.probs[:, 1:], dim=1)
# Cell selection examples according to current model.
is_pred_cell_selection = aggregation_ops_total_mass <= cell_selection_preference
# Examples with non-empty cell selection supervision.
is_cell_supervision_available = torch.sum(labels, dim=1) > 0
# torch.where is not equivalent to tf.where (in tensorflow 1)
# hence the added .view on the condition to match the shape of the first tensor
aggregate_mask = torch.where(
torch.logical_and(is_pred_cell_selection, is_cell_supervision_available).view(aggregate_mask_init.size()),
torch.zeros_like(aggregate_mask_init, dtype=torch.float32),
aggregate_mask_init,
)
aggregate_mask = aggregate_mask.detach()
return aggregate_mask
def _calculate_aggregation_loss_known(
logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels
):
"""
Calculates aggregation loss when its type is known during training.
In the weakly supervised setting, the only known information is that for cell selection examples, "no aggregation"
should be predicted. For other examples (those that require aggregation), no loss is accumulated. In the setting
where aggregation type is always known, standard cross entropy loss is accumulated for all examples
Args:
logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
Logits per aggregation operation.
aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
A mask set to 1 for examples that should use aggregation functions.
aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`):
Aggregation function id for every example in the batch.
use_answer_as_supervision (:obj:`bool`, `optional`):
Whether to use the answer as the only supervision for aggregation examples.
num_aggregation_labels (:obj:`int`, `optional`, defaults to 0):
The number of aggregation operators to predict.
Returns:
aggregation_loss_known (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (when its
type is known during training) per example.
"""
if use_answer_as_supervision:
# Prepare "no aggregation" targets for cell selection examples.
target_aggregation = torch.zeros_like(aggregate_mask, dtype=torch.long)
else:
# Use aggregation supervision as the target.
target_aggregation = aggregation_labels
one_hot_labels = torch.nn.functional.one_hot(target_aggregation, num_classes=num_aggregation_labels).type(
torch.float32
)
log_probs = torch.nn.functional.log_softmax(logits_aggregation, dim=-1)
# torch.FloatTensor[batch_size]
per_example_aggregation_intermediate = -torch.sum(one_hot_labels * log_probs, dim=-1)
if use_answer_as_supervision:
# Accumulate loss only for examples requiring cell selection
# (no aggregation).
return per_example_aggregation_intermediate * (1 - aggregate_mask)
else:
return per_example_aggregation_intermediate
def _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask):
"""
Calculates aggregation loss in the case of answer supervision.
Args:
logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
Logits per aggregation operation.
aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
A mask set to 1 for examples that should use aggregation functions
Returns:
aggregation_loss_unknown (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (in case of
answer supervision) per example.
"""
dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation)
# Index 0 correponds to "no aggregation".
aggregation_ops_total_mass = torch.sum(dist_aggregation.probs[:, 1:], dim=1)
# Predict some aggregation in case of an answer that needs aggregation.
# This increases the probability of all aggregation functions, in a way
# similar to MML, but without considering whether the function gives the
# correct answer.
return -torch.log(aggregation_ops_total_mass) * aggregate_mask
def _calculate_aggregation_loss(
logits_aggregation,
aggregate_mask,
aggregation_labels,
use_answer_as_supervision,
num_aggregation_labels,
aggregation_loss_weight,
):
"""
Calculates the aggregation loss per example.
Args:
logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
Logits per aggregation operation.
aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
A mask set to 1 for examples that should use aggregation functions.
aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`):
Aggregation function id for every example in the batch.
use_answer_as_supervision (:obj:`bool`, `optional`):
Whether to use the answer as the only supervision for aggregation examples.
num_aggregation_labels (:obj:`int`, `optional`, defaults to 0):
The number of aggregation operators to predict.
aggregation_loss_weight (:obj:`float`, `optional`, defaults to 1.0):
Importance weight for the aggregation loss.
Returns:
aggregation_loss (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss per example.
"""
per_example_aggregation_loss = _calculate_aggregation_loss_known(
logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels
)
if use_answer_as_supervision:
# Add aggregation loss for numeric answers that need aggregation.
per_example_aggregation_loss += _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask)
return aggregation_loss_weight * per_example_aggregation_loss
def _calculate_expected_result(
dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config
):
"""
Calculates the expected result given cell and aggregation probabilities.
Args:
dist_per_cell (:obj:`torch.distributions.Bernoulli`):
Cell selection distribution for each cell.
numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
Numeric values of every token. Nan for tokens which are not numeric values.
numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
Scale of the numeric values of every token.
input_mask_float (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
Mask for the table, without question tokens and table headers.
logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
Logits per aggregation operation.
config (:class:`~transformers.TapasConfig`):
Model configuration class with all the hyperparameters of the model
Returns:
expected_result (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): The expected result per example.
"""
if config.use_gumbel_for_cells:
gumbel_dist = torch.distributions.RelaxedBernoulli(
# The token logits where already divided by the temperature and used for
# computing cell selection errors so we need to multiply it again here
temperature=config.temperature,
logits=dist_per_cell.logits * config.temperature,
)
scaled_probability_per_cell = gumbel_dist.sample()
else:
scaled_probability_per_cell = dist_per_cell.probs
# <float32>[batch_size, seq_length]
scaled_probability_per_cell = (scaled_probability_per_cell / numeric_values_scale) * input_mask_float
count_result = torch.sum(scaled_probability_per_cell, dim=1)
numeric_values_masked = torch.where(
torch.isnan(numeric_values), torch.zeros_like(numeric_values), numeric_values
) # Mask non-numeric table values to zero.
sum_result = torch.sum(scaled_probability_per_cell * numeric_values_masked, dim=1)
avg_approximation = config.average_approximation_function
if avg_approximation == AverageApproximationFunction.RATIO:
average_result = sum_result / (count_result + EPSILON_ZERO_DIVISION)
elif avg_approximation == AverageApproximationFunction.FIRST_ORDER:
# The sum of all probabilities except that correspond to other cells
# Ex here stands for expectation, more explicitly the expectation of the sum of N-1 Bernoulli random variables plus
# the constant 1, which is computed as adding all N expected values and subtracting the extra one. It corresponds to X_c
# in Appendix D of the original TAPAS paper which is trying to approximate the average of a random set.
ex = torch.sum(scaled_probability_per_cell, dim=1, keepdim=True) - scaled_probability_per_cell + 1
average_result = torch.sum(numeric_values_masked * scaled_probability_per_cell / ex, dim=1)
elif avg_approximation == AverageApproximationFunction.SECOND_ORDER:
# The sum of all probabilities except that correspond to other cells
ex = torch.sum(scaled_probability_per_cell, dim=1, keepdim=True) - scaled_probability_per_cell + 1
pointwise_var = scaled_probability_per_cell * (1 - scaled_probability_per_cell)
var = torch.sum(pointwise_var, dim=1, keepdim=True) - pointwise_var
multiplier = (var / torch.square(ex) + 1) / ex
average_result = torch.sum(numeric_values_masked * scaled_probability_per_cell * multiplier, dim=1)
else:
raise ValueError(f"Invalid average_approximation_function: {config.average_approximation_function}")
if config.use_gumbel_for_aggregation:
gumbel_dist = torch.distributions.RelaxedOneHotCategorical(
config.aggregation_temperature, logits=logits_aggregation[:, 1:]
)
# <float32>[batch_size, num_aggregation_labels - 1]
aggregation_op_only_probs = gumbel_dist.sample()
else:
# <float32>[batch_size, num_aggregation_labels - 1]
aggregation_op_only_probs = torch.nn.functional.softmax(
logits_aggregation[:, 1:] / config.aggregation_temperature, dim=-1
)
all_results = torch.cat(
[
torch.unsqueeze(sum_result, dim=1),
torch.unsqueeze(average_result, dim=1),
torch.unsqueeze(count_result, dim=1),
],
dim=1,
)
expected_result = torch.sum(all_results * aggregation_op_only_probs, dim=1)
return expected_result
# PyTorch does not currently support Huber loss with custom delta so we define it ourself
def huber_loss(input, target, delta: float = 1.0):
errors = torch.abs(input - target) # shape (batch_size,)
return torch.where(errors < delta, 0.5 * errors ** 2, errors * delta - (0.5 * delta ** 2))
def _calculate_regression_loss(
answer,
aggregate_mask,
dist_per_cell,
numeric_values,
numeric_values_scale,
input_mask_float,
logits_aggregation,
config,
):
"""
Calculates the regression loss per example.
Args:
answer (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`):
Answer for every example in the batch. Nan if there is no scalar answer.
aggregate_mask (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`):
A mask set to 1 for examples that should use aggregation functions.
dist_per_cell (:obj:`torch.distributions.Bernoulli`):
Cell selection distribution for each cell.
numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
Numeric values of every token. Nan for tokens which are not numeric values.
numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
Scale of the numeric values of every token.
input_mask_float (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
Mask for the table, without question tokens and table headers.
logits_aggregation (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
Logits per aggregation operation.
config (:class:`~transformers.TapasConfig`):
Model configuration class with all the parameters of the model
Returns:
per_example_answer_loss_scaled (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Scales answer loss for
each example in the batch. large_answer_loss_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A
mask which is 1 for examples for which their answer loss is larger than the answer_loss_cutoff.
"""
# float32 (batch_size,)
expected_result = _calculate_expected_result(
dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config
)
# float32 (batch_size,)
answer_masked = torch.where(torch.isnan(answer), torch.zeros_like(answer), answer)
if config.use_normalized_answer_loss:
normalizer = (torch.max(torch.abs(expected_result), torch.abs(answer_masked)) + EPSILON_ZERO_DIVISION).detach()
normalized_answer_masked = answer_masked / normalizer
normalized_expected_result = expected_result / normalizer
per_example_answer_loss = huber_loss(
normalized_expected_result * aggregate_mask, normalized_answer_masked * aggregate_mask
)
else:
per_example_answer_loss = huber_loss(
expected_result * aggregate_mask, answer_masked * aggregate_mask, delta=config.huber_loss_delta
)
if config.answer_loss_cutoff is None:
large_answer_loss_mask = torch.ones_like(per_example_answer_loss, dtype=torch.float32)
else:
large_answer_loss_mask = torch.where(
per_example_answer_loss > config.answer_loss_cutoff,
torch.zeros_like(per_example_answer_loss, dtype=torch.float32),
torch.ones_like(per_example_answer_loss, dtype=torch.float32),
)
per_example_answer_loss_scaled = config.answer_loss_importance * (per_example_answer_loss * aggregate_mask)
return per_example_answer_loss_scaled, large_answer_loss_mask
# coding=utf-8
# Copyright 2020 Google Research and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Tokenization class for TAPAS model."""
import collections
import datetime
import enum
import itertools
import math
import os
import re
import unicodedata
from dataclasses import dataclass
from typing import Callable, Dict, Generator, List, Optional, Text, Tuple, Union
import numpy as np
from transformers import add_end_docstrings
from ...file_utils import is_pandas_available
from ...tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
from ...tokenization_utils_base import (
ENCODE_KWARGS_DOCSTRING,
BatchEncoding,
EncodedInput,
ExplicitEnum,
PaddingStrategy,
PreTokenizedInput,
TensorType,
TextInput,
)
from ...utils import logging
if is_pandas_available():
import pandas as pd
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"google/tapas-base-finetuned-sqa": "https://huggingface.co/google/tapas-base-finetuned-sqa/resolve/main/vocab.txt",
"google/tapas-base-finetuned-wtq": "https://huggingface.co/google/tapas-base-finetuned-wtq/resolve/main/vocab.txt",
"google/tapas-base-finetuned-wikisql-supervised": "https://huggingface.co/google/tapas-base-finetuned-wikisql-supervised/resolve/main/vocab.txt",
"google/tapas-base-finetuned-tabfact": "https://huggingface.co/google/tapas-base-finetuned-tabfact/resolve/main/vocab.txt",
}
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
# large models
"google/tapas-large-finetuned-sqa": 512,
"google/tapas-large-finetuned-wtq": 512,
"google/tapas-large-finetuned-wikisql-supervised": 512,
"google/tapas-large-finetuned-tabfact": 512,
# base models
"google/tapas-base-finetuned-sqa": 512,
"google/tapas-base-finetuned-wtq": 512,
"google/tapas-base-finetuned-wikisql-supervised": 512,
"google/tapas-base-finetuned-tabfact": 512,
# medium models
"google/tapas-medium-finetuned-sqa": 512,
"google/tapas-medium-finetuned-wtq": 512,
"google/tapas-medium-finetuned-wikisql-supervised": 512,
"google/tapas-medium-finetuned-tabfact": 512,
# small models
"google/tapas-base-finetuned-sqa": 512,
"google/tapas-base-finetuned-wtq": 512,
"google/tapas-base-finetuned-wikisql-supervised": 512,
"google/tapas-base-finetuned-tabfact": 512,
# tiny models
"google/tapas-tiny-finetuned-sqa": 512,
"google/tapas-tiny-finetuned-wtq": 512,
"google/tapas-tiny-finetuned-wikisql-supervised": 512,
"google/tapas-tiny-finetuned-tabfact": 512,
# mini models
"google/tapas-mini-finetuned-sqa": 512,
"google/tapas-mini-finetuned-wtq": 512,
"google/tapas-mini-finetuned-wikisql-supervised": 512,
"google/tapas-mini-finetuned-tabfact": 512,
}
PRETRAINED_INIT_CONFIGURATION = {
"google/tapas-base-finetuned-sqa": {"do_lower_case": True},
"google/tapas-base-finetuned-wtq": {"do_lower_case": True},
"google/tapas-base-finetuned-wikisql-supervised": {"do_lower_case": True},
"google/tapas-base-finetuned-tabfact": {"do_lower_case": True},
}
class TapasTruncationStrategy(ExplicitEnum):
"""
Possible values for the ``truncation`` argument in :meth:`~transformers.TapasTokenizer.__call__`. Useful for
tab-completion in an IDE.
"""
DROP_ROWS_TO_FIT = "drop_rows_to_fit"
DO_NOT_TRUNCATE = "do_not_truncate"
TableValue = collections.namedtuple("TokenValue", ["token", "column_id", "row_id"])
@dataclass(frozen=True)
class TokenCoordinates:
column_index: int
row_index: int
token_index: int
@dataclass
class TokenizedTable:
rows: List[List[List[Text]]]
selected_tokens: List[TokenCoordinates]
@dataclass(frozen=True)
class SerializedExample:
tokens: List[Text]
column_ids: List[int]
row_ids: List[int]
segment_ids: List[int]
def _is_inner_wordpiece(token: Text):
return token.startswith("##")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, "r", encoding="utf-8") as reader:
tokens = reader.readlines()
for index, token in enumerate(tokens):
token = token.rstrip("\n")
vocab[token] = index
return vocab
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING = r"""
add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to encode the sequences with the special tokens relative to their model.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
Activates and controls padding. Accepts the following values:
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
single sequence if provided).
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
maximum acceptable input length for the model if that argument is not provided.
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
different lengths).
truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.TapasTruncationStrategy`, `optional`, defaults to :obj:`False`):
Activates and controls truncation. Accepts the following values:
* :obj:`True` or :obj:`'drop_rows_to_fit'`: Truncate to a maximum length specified with the argument
:obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
provided. This will truncate row by row, removing rows from the table.
* :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
sequence lengths greater than the model maximum admissible input size).
max_length (:obj:`int`, `optional`):
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
length is required by one of the truncation/padding parameters. If the model has no specific maximum
input length (like XLNet) truncation/padding to a maximum length will be deactivated.
is_split_into_words (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer
will skip the pre-tokenization step. This is useful for NER or token classification.
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`):
If set, will return tensors instead of list of python integers. Acceptable values are:
* :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
* :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
* :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
"""
class TapasTokenizer(PreTrainedTokenizer):
r"""
Construct a TAPAS tokenizer. Based on WordPiece. Flattens a table and one or more related sentences to be used by
TAPAS models.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
:class:`~transformers.TapasTokenizer` creates several token type ids to encode tabular structure. To be more
precise, it adds 7 token type ids, in the following order: :obj:`segment_ids`, :obj:`column_ids`, :obj:`row_ids`,
:obj:`prev_labels`, :obj:`column_ranks`, :obj:`inv_column_ranks` and :obj:`numeric_relations`:
- segment_ids: indicate whether a token belongs to the question (0) or the table (1). 0 for special tokens and
padding.
- column_ids: indicate to which column of the table a token belongs (starting from 1). Is 0 for all question
tokens, special tokens and padding.
- row_ids: indicate to which row of the table a token belongs (starting from 1). Is 0 for all question tokens,
special tokens and padding. Tokens of column headers are also 0.
- prev_labels: indicate whether a token was (part of) an answer to the previous question (1) or not (0). Useful in
a conversational setup (such as SQA).
- column_ranks: indicate the rank of a table token relative to a column, if applicable. For example, if you have a
column "number of movies" with values 87, 53 and 69, then the column ranks of these tokens are 3, 1 and 2
respectively. 0 for all question tokens, special tokens and padding.
- inv_column_ranks: indicate the inverse rank of a table token relative to a column, if applicable. For example, if
you have a column "number of movies" with values 87, 53 and 69, then the inverse column ranks of these tokens are
1, 3 and 2 respectively. 0 for all question tokens, special tokens and padding.
- numeric_relations: indicate numeric relations between the question and the tokens of the table. 0 for all
question tokens, special tokens and padding.
:class:`~transformers.TapasTokenizer` runs end-to-end tokenization on a table and associated sentences: punctuation
splitting and wordpiece.
Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to do basic tokenization before WordPiece.
never_split (:obj:`Iterable`, `optional`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
empty_token (:obj:`str`, `optional`, defaults to :obj:`"[EMPTY]"`):
The token used for empty cell values in a table. Empty cell values include "", "n/a", "nan" and "?".
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this
`issue <https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
cell_trim_length (:obj:`int`, `optional`, defaults to -1):
If > 0: Trim cells so that the length is <= this value. Also disables further cell trimming, should thus be
used with 'drop_rows_to_fit' below.
max_column_id (:obj:`int`, `optional`):
Max column id to extract.
max_row_id (:obj:`int`, `optional`):
Max row id to extract.
strip_column_names (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to add empty strings instead of column names.
update_answer_coordinates (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to recompute the answer coordinates from the answer text.
drop_rows_to_fit (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to drop the last rows if a table doesn't fit within max sequence length.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
def __init__(
self,
vocab_file,
do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
unk_token="[UNK]",
sep_token="[SEP]",
pad_token="[PAD]",
cls_token="[CLS]",
mask_token="[MASK]",
empty_token="[EMPTY]",
tokenize_chinese_chars=True,
strip_accents=None,
cell_trim_length: int = -1,
max_column_id: int = None,
max_row_id: int = None,
strip_column_names: bool = False,
update_answer_coordinates: bool = False,
drop_rows_to_fit: bool = False,
model_max_length: int = 512,
additional_special_tokens: Optional[List[str]] = None,
**kwargs
):
if not is_pandas_available():
raise ImportError("Pandas is required for the TAPAS tokenizer.")
if additional_special_tokens is not None:
if empty_token not in additional_special_tokens:
additional_special_tokens.append(empty_token)
else:
additional_special_tokens = [empty_token]
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
empty_token=empty_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
cell_trim_length=cell_trim_length,
max_column_id=max_column_id,
max_row_id=max_row_id,
strip_column_names=strip_column_names,
update_answer_coordinates=update_answer_coordinates,
drop_rows_to_fit=drop_rows_to_fit,
model_max_length=model_max_length,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
if not os.path.isfile(vocab_file):
raise ValueError(
"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
)
self.vocab = load_vocab(vocab_file)
self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
self.do_basic_tokenize = do_basic_tokenize
if do_basic_tokenize:
self.basic_tokenizer = BasicTokenizer(
do_lower_case=do_lower_case,
never_split=never_split,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
# Additional properties
self.cell_trim_length = cell_trim_length
self.max_column_id = max_column_id if max_column_id is not None else self.model_max_length
self.max_row_id = max_row_id if max_row_id is not None else self.model_max_length
self.strip_column_names = strip_column_names
self.update_answer_coordinates = update_answer_coordinates
self.drop_rows_to_fit = drop_rows_to_fit
@property
def do_lower_case(self):
return self.basic_tokenizer.do_lower_case
@property
def vocab_size(self):
return len(self.vocab)
def get_vocab(self):
return dict(self.vocab, **self.added_tokens_encoder)
def _tokenize(self, text):
if format_text(text) == EMPTY_TEXT:
return [self.additional_special_tokens[0]]
split_tokens = []
if self.do_basic_tokenize:
for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
# If the token is part of the never_split set
if token in self.basic_tokenizer.never_split:
split_tokens.append(token)
else:
split_tokens += self.wordpiece_tokenizer.tokenize(token)
else:
split_tokens = self.wordpiece_tokenizer.tokenize(text)
return split_tokens
def _convert_token_to_id(self, token):
""" Converts a token (str) in an id using the vocab. """
return self.vocab.get(token, self.vocab.get(self.unk_token))
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.ids_to_tokens.get(index, self.unk_token)
def convert_tokens_to_string(self, tokens):
""" Converts a sequence of tokens (string) in a single string. """
out_string = " ".join(tokens).replace(" ##", "").strip()
return out_string
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
index = 0
if os.path.isdir(save_directory):
vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
else:
vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
with open(vocab_file, "w", encoding="utf-8") as writer:
for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
if index != token_index:
logger.warning(
f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
" Please check that the vocabulary is not corrupted!"
)
index = token_index
writer.write(token + "\n")
index += 1
return (vocab_file,)
def create_attention_mask_from_sequences(self, query_ids: List[int], table_values: List[TableValue]) -> List[int]:
"""
Creates the attention mask according to the query token IDs and a list of table values.
Args:
query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
token value, the column ID and the row ID of said token.
Returns:
:obj:`List[int]`: List of ints containing the attention mask values.
"""
return [1] * (1 + len(query_ids) + 1 + len(table_values))
def create_segment_token_type_ids_from_sequences(
self, query_ids: List[int], table_values: List[TableValue]
) -> List[int]:
"""
Creates the segment token type IDs according to the query token IDs and a list of table values.
Args:
query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
token value, the column ID and the row ID of said token.
Returns:
:obj:`List[int]`: List of ints containing the segment token type IDs values.
"""
table_ids = list(zip(*table_values))[0] if table_values else []
return [0] * (1 + len(query_ids) + 1) + [1] * len(table_ids)
def create_column_token_type_ids_from_sequences(
self, query_ids: List[int], table_values: List[TableValue]
) -> List[int]:
"""
Creates the column token type IDs according to the query token IDs and a list of table values.
Args:
query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
token value, the column ID and the row ID of said token.
Returns:
:obj:`List[int]`: List of ints containing the column token type IDs values.
"""
table_column_ids = list(zip(*table_values))[1] if table_values else []
return [0] * (1 + len(query_ids) + 1) + list(table_column_ids)
def create_row_token_type_ids_from_sequences(
self, query_ids: List[int], table_values: List[TableValue]
) -> List[int]:
"""
Creates the row token type IDs according to the query token IDs and a list of table values.
Args:
query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
token value, the column ID and the row ID of said token.
Returns:
:obj:`List[int]`: List of ints containing the row token type IDs values.
"""
table_row_ids = list(zip(*table_values))[2] if table_values else []
return [0] * (1 + len(query_ids) + 1) + list(table_row_ids)
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a question and flattened table for question answering or sequence classification tasks
by concatenating and adding special tokens.
Args:
token_ids_0 (:obj:`List[int]`): The ids of the question.
token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table.
Returns:
:obj:`List[int]`: The model input with special tokens.
"""
if token_ids_1 is None:
raise ValueError("With TAPAS, you must provide both question IDs and table IDs.")
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of question IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
List of flattened table IDs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
return [1] + ([0] * len(token_ids_0)) + [1]
@add_end_docstrings(TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
def __call__(
self,
table: "pd.DataFrame",
queries: Optional[
Union[
TextInput,
PreTokenizedInput,
EncodedInput,
List[TextInput],
List[PreTokenizedInput],
List[EncodedInput],
]
] = None,
answer_coordinates: Optional[Union[List[Tuple], List[List[Tuple]]]] = None,
answer_text: Optional[Union[List[TextInput], List[List[TextInput]]]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
**kwargs
) -> BatchEncoding:
"""
Main method to tokenize and prepare for the model one or several sequence(s) related to a table.
Args:
table (:obj:`pd.DataFrame`):
Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas
dataframe to convert it to string.
queries (:obj:`str` or :obj:`List[str]`):
Question or batch of questions related to a table to be encoded. Note that in case of a batch, all
questions must refer to the **same** table.
answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
Answer coordinates of each table-question pair in the batch. In case only a single table-question pair
is provided, then the answer_coordinates must be a single list of one or more tuples. Each tuple must
be a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The
first column has index 0. In case a batch of table-question pairs is provided, then the
answer_coordinates must be a list of lists of tuples (each list corresponding to a single
table-question pair).
answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
Answer text of each table-question pair in the batch. In case only a single table-question pair is
provided, then the answer_text must be a single list of one or more strings. Each string must be the
answer text of a corresponding answer coordinate. In case a batch of table-question pairs is provided,
then the answer_coordinates must be a list of lists of strings (each list corresponding to a single
table-question pair).
"""
assert isinstance(table, pd.DataFrame), "Table must be of type pd.DataFrame"
# Input type checking for clearer error
valid_query = False
# Check that query has a valid type
if queries is None or isinstance(queries, str):
valid_query = True
elif isinstance(queries, (list, tuple)):
if len(queries) == 0 or isinstance(queries[0], str):
valid_query = True
if not valid_query:
raise ValueError(
"queries input must of type `str` (single example), `List[str]` (batch or single pretokenized example). "
)
is_batched = isinstance(queries, (list, tuple))
if is_batched:
return self.batch_encode_plus(
table=table,
queries=queries,
answer_coordinates=answer_coordinates,
answer_text=answer_text,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
**kwargs,
)
else:
return self.encode_plus(
table=table,
query=queries,
answer_coordinates=answer_coordinates,
answer_text=answer_text,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
**kwargs,
)
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
def batch_encode_plus(
self,
table: "pd.DataFrame",
queries: Optional[
Union[
List[TextInput],
List[PreTokenizedInput],
List[EncodedInput],
]
] = None,
answer_coordinates: Optional[List[List[Tuple]]] = None,
answer_text: Optional[List[List[TextInput]]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
**kwargs
) -> BatchEncoding:
"""
Prepare a table and a list of strings for the model.
.. warning::
This method is deprecated, ``__call__`` should be used instead.
Args:
table (:obj:`pd.DataFrame`):
Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas
dataframe to convert it to string.
queries (:obj:`List[str]`):
Batch of questions related to a table to be encoded. Note that all questions must refer to the **same**
table.
answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
Answer coordinates of each table-question pair in the batch. Each tuple must be a (row_index,
column_index) pair. The first data row (not the column header row) has index 0. The first column has
index 0. The answer_coordinates must be a list of lists of tuples (each list corresponding to a single
table-question pair).
answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
Answer text of each table-question pair in the batch. In case a batch of table-question pairs is
provided, then the answer_coordinates must be a list of lists of strings (each list corresponding to a
single table-question pair). Each string must be the answer text of a corresponding answer coordinate.
"""
if return_token_type_ids is not None and not add_special_tokens:
raise ValueError(
"Asking to return token_type_ids while setting add_special_tokens to False "
"results in an undefined behavior. Please set add_special_tokens to True or "
"set return_token_type_ids to None."
)
if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text):
raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided")
elif answer_coordinates is None and answer_text is None:
answer_coordinates = answer_text = [None] * len(queries)
if "is_split_into_words" in kwargs:
raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.")
if return_offsets_mapping:
raise NotImplementedError(
"return_offset_mapping is not available when using Python tokenizers."
"To use this feature, change your tokenizer to one deriving from "
"transformers.PreTrainedTokenizerFast."
)
return self._batch_encode_plus(
table=table,
queries=queries,
answer_coordinates=answer_coordinates,
answer_text=answer_text,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
**kwargs,
)
def _batch_encode_plus(
self,
table,
queries: Union[
List[TextInput],
List[PreTokenizedInput],
List[EncodedInput],
],
answer_coordinates: Optional[List[List[Tuple]]] = None,
answer_text: Optional[List[List[TextInput]]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = True,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
**kwargs
) -> BatchEncoding:
table_tokens = self._tokenize_table(table)
queries_tokens = []
for query in queries:
query_tokens = self.tokenize(query)
queries_tokens.append(query_tokens)
batch_outputs = self._batch_prepare_for_model(
table,
queries,
tokenized_table=table_tokens,
queries_tokens=queries_tokens,
answer_coordinates=answer_coordinates,
padding=padding,
truncation=truncation,
answer_text=answer_text,
add_special_tokens=add_special_tokens,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
prepend_batch_axis=True,
return_attention_mask=return_attention_mask,
return_token_type_ids=return_token_type_ids,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_length=return_length,
verbose=verbose,
)
return BatchEncoding(batch_outputs)
def _batch_prepare_for_model(
self,
raw_table: "pd.DataFrame",
raw_queries: Union[
List[TextInput],
List[PreTokenizedInput],
List[EncodedInput],
],
tokenized_table: Optional[TokenizedTable] = None,
queries_tokens: Optional[List[List[str]]] = None,
answer_coordinates: Optional[List[List[Tuple]]] = None,
answer_text: Optional[List[List[TextInput]]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = True,
return_attention_mask: Optional[bool] = True,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
prepend_batch_axis: bool = False,
**kwargs
) -> BatchEncoding:
batch_outputs = {}
for index, example in enumerate(zip(raw_queries, queries_tokens, answer_coordinates, answer_text)):
raw_query, query_tokens, answer_coords, answer_txt = example
outputs = self.prepare_for_model(
raw_table,
raw_query,
tokenized_table=tokenized_table,
query_tokens=query_tokens,
answer_coordinates=answer_coords,
answer_text=answer_txt,
add_special_tokens=add_special_tokens,
padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterwards
truncation=truncation,
max_length=max_length,
pad_to_multiple_of=None, # we pad in batch afterwards
return_attention_mask=False, # we pad in batch afterwards
return_token_type_ids=return_token_type_ids,
return_special_tokens_mask=return_special_tokens_mask,
return_length=return_length,
return_tensors=None, # We convert the whole batch to tensors at the end
prepend_batch_axis=False,
verbose=verbose,
prev_answer_coordinates=answer_coordinates[index - 1] if index != 0 else None,
prev_answer_text=answer_text[index - 1] if index != 0 else None,
)
for key, value in outputs.items():
if key not in batch_outputs:
batch_outputs[key] = []
batch_outputs[key].append(value)
batch_outputs = self.pad(
batch_outputs,
padding=padding,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
)
batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
return batch_outputs
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING)
def encode(
self,
table: "pd.DataFrame",
query: Optional[
Union[
TextInput,
PreTokenizedInput,
EncodedInput,
]
] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs
) -> List[int]:
"""
Prepare a table and a string for the model. This method does not return token type IDs, attention masks, etc.
which are necessary for the model to work correctly. Use that method if you want to build your processing on
your own, otherwise refer to ``__call__``.
Args:
table (:obj:`pd.DataFrame`):
Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas
dataframe to convert it to string.
query (:obj:`str` or :obj:`List[str]`):
Question related to a table to be encoded.
"""
encoded_inputs = self.encode_plus(
table,
query=query,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
return_tensors=return_tensors,
**kwargs,
)
return encoded_inputs["input_ids"]
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
def encode_plus(
self,
table: "pd.DataFrame",
query: Optional[
Union[
TextInput,
PreTokenizedInput,
EncodedInput,
]
] = None,
answer_coordinates: Optional[List[Tuple]] = None,
answer_text: Optional[List[TextInput]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
**kwargs
) -> BatchEncoding:
"""
Prepare a table and a string for the model.
Args:
table (:obj:`pd.DataFrame`):
Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas
dataframe to convert it to string.
query (:obj:`str` or :obj:`List[str]`):
Question related to a table to be encoded.
answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
Answer coordinates of each table-question pair in the batch. The answer_coordinates must be a single
list of one or more tuples. Each tuple must be a (row_index, column_index) pair. The first data row
(not the column header row) has index 0. The first column has index 0.
answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
Answer text of each table-question pair in the batch. The answer_text must be a single list of one or
more strings. Each string must be the answer text of a corresponding answer coordinate.
"""
if return_token_type_ids is not None and not add_special_tokens:
raise ValueError(
"Asking to return token_type_ids while setting add_special_tokens to False "
"results in an undefined behavior. Please set add_special_tokens to True or "
"set return_token_type_ids to None."
)
if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text):
raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided")
if "is_split_into_words" in kwargs:
raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.")
if return_offsets_mapping:
raise NotImplementedError(
"return_offset_mapping is not available when using Python tokenizers."
"To use this feature, change your tokenizer to one deriving from "
"transformers.PreTrainedTokenizerFast."
)
return self._encode_plus(
table=table,
query=query,
answer_coordinates=answer_coordinates,
answer_text=answer_text,
add_special_tokens=add_special_tokens,
truncation=truncation,
padding=padding,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
**kwargs,
)
def _encode_plus(
self,
table: "pd.DataFrame",
query: Union[
TextInput,
PreTokenizedInput,
EncodedInput,
],
answer_coordinates: Optional[List[Tuple]] = None,
answer_text: Optional[List[TextInput]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = True,
return_attention_mask: Optional[bool] = True,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
**kwargs
):
if query is None:
query = ""
logger.warning(
"TAPAS is a question answering model but you have not passed a query. Please be aware that the "
"model will probably not behave correctly."
)
table_tokens = self._tokenize_table(table)
query_tokens = self.tokenize(query)
return self.prepare_for_model(
table,
query,
tokenized_table=table_tokens,
query_tokens=query_tokens,
answer_coordinates=answer_coordinates,
answer_text=answer_text,
add_special_tokens=add_special_tokens,
truncation=truncation,
padding=padding,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
prepend_batch_axis=True,
return_attention_mask=return_attention_mask,
return_token_type_ids=return_token_type_ids,
return_special_tokens_mask=return_special_tokens_mask,
return_length=return_length,
verbose=verbose,
)
@add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
def prepare_for_model(
self,
raw_table: "pd.DataFrame",
raw_query: Union[
TextInput,
PreTokenizedInput,
EncodedInput,
],
tokenized_table: Optional[TokenizedTable] = None,
query_tokens: Optional[TokenizedTable] = None,
answer_coordinates: Optional[List[Tuple]] = None,
answer_text: Optional[List[TextInput]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TapasTruncationStrategy] = False,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = True,
return_attention_mask: Optional[bool] = True,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
prepend_batch_axis: bool = False,
**kwargs
) -> BatchEncoding:
"""
Prepares a sequence of input id so that it can be used by the model. It adds special tokens, truncates
sequences if overflowing while taking into account the special tokens.
Args:
raw_table (:obj:`pd.DataFrame`):
The original table before any transformation (like tokenization) was applied to it.
raw_query (:obj:`TextInput` or :obj:`PreTokenizedInput` or :obj:`EncodedInput`):
The original query before any transformation (like tokenization) was applied to it.
tokenized_table (:obj:`TokenizedTable`):
The table after tokenization.
query_tokens (:obj:`List[str]`):
The query after tokenization.
answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
Answer coordinates of each table-question pair in the batch. The answer_coordinates must be a single
list of one or more tuples. Each tuple must be a (row_index, column_index) pair. The first data row
(not the column header row) has index 0. The first column has index 0.
answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
Answer text of each table-question pair in the batch. The answer_text must be a single list of one or
more strings. Each string must be the answer text of a corresponding answer coordinate.
"""
if isinstance(padding, bool):
if padding and (max_length is not None or pad_to_multiple_of is not None):
padding = PaddingStrategy.MAX_LENGTH
else:
padding = PaddingStrategy.DO_NOT_PAD
elif not isinstance(padding, PaddingStrategy):
padding = PaddingStrategy(padding)
if isinstance(truncation, bool):
if truncation:
truncation = TapasTruncationStrategy.DROP_ROWS_TO_FIT
else:
truncation = TapasTruncationStrategy.DO_NOT_TRUNCATE
elif not isinstance(truncation, TapasTruncationStrategy):
truncation = TapasTruncationStrategy(truncation)
encoded_inputs = {}
is_part_of_batch = False
prev_answer_coordinates, prev_answer_text = None, None
if "prev_answer_coordinates" in kwargs and "prev_answer_text" in kwargs:
is_part_of_batch = True
prev_answer_coordinates = kwargs["prev_answer_coordinates"]
prev_answer_text = kwargs["prev_answer_text"]
num_rows = self._get_num_rows(raw_table, self.drop_rows_to_fit)
num_columns = self._get_num_columns(raw_table)
_, _, num_tokens = self._get_table_boundaries(tokenized_table)
if truncation != TapasTruncationStrategy.DO_NOT_TRUNCATE:
num_rows, num_tokens = self._get_truncated_table_rows(
query_tokens, tokenized_table, num_rows, num_columns, max_length, truncation_strategy=truncation
)
table_data = list(self._get_table_values(tokenized_table, num_columns, num_rows, num_tokens))
query_ids = self.convert_tokens_to_ids(query_tokens)
table_ids = list(zip(*table_data))[0] if len(table_data) > 0 else list(zip(*table_data))
table_ids = self.convert_tokens_to_ids(list(table_ids))
if "return_overflowing_tokens" in kwargs and kwargs["return_overflowing_tokens"]:
raise ValueError("TAPAS does not return overflowing tokens as it works on tables.")
if add_special_tokens:
input_ids = self.build_inputs_with_special_tokens(query_ids, table_ids)
else:
input_ids = query_ids + table_ids
if max_length is not None and len(input_ids) > max_length:
raise ValueError(
"Could not encode the query and table header given the maximum length. Encoding the query and table"
f"header results in a length of {len(input_ids)} which is higher than the max_length of {max_length}"
)
encoded_inputs["input_ids"] = input_ids
segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data)
column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data)
row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data)
if not is_part_of_batch or (prev_answer_coordinates is None and prev_answer_text is None):
# simply set the prev_labels to zeros
prev_labels = [0] * len(row_ids)
else:
prev_labels = self.get_answer_ids(
column_ids, row_ids, table_data, prev_answer_text, prev_answer_coordinates
)
# FIRST: parse both the table and question in terms of numeric values
raw_table = add_numeric_table_values(raw_table)
raw_query = add_numeric_values_to_question(raw_query)
# SECOND: add numeric-related features (and not parse them in these functions):
column_ranks, inv_column_ranks = self._get_numeric_column_ranks(column_ids, row_ids, raw_table)
numeric_relations = self._get_numeric_relations(raw_query, column_ids, row_ids, raw_table)
# Load from model defaults
if return_token_type_ids is None:
return_token_type_ids = "token_type_ids" in self.model_input_names
if return_attention_mask is None:
return_attention_mask = "attention_mask" in self.model_input_names
if return_attention_mask:
attention_mask = self.create_attention_mask_from_sequences(query_ids, table_data)
encoded_inputs["attention_mask"] = attention_mask
if answer_coordinates is not None and answer_text is not None:
labels = self.get_answer_ids(column_ids, row_ids, table_data, answer_text, answer_coordinates)
numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids)
numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids)
encoded_inputs["labels"] = labels
encoded_inputs["numeric_values"] = numeric_values
encoded_inputs["numeric_values_scale"] = numeric_values_scale
if return_token_type_ids:
token_type_ids = [
segment_ids,
column_ids,
row_ids,
prev_labels,
column_ranks,
inv_column_ranks,
numeric_relations,
]
token_type_ids = [list(ids) for ids in list(zip(*token_type_ids))]
encoded_inputs["token_type_ids"] = token_type_ids
if return_special_tokens_mask:
if add_special_tokens:
encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(query_ids, table_ids)
else:
encoded_inputs["special_tokens_mask"] = [0] * len(input_ids)
# Check lengths
if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose:
if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
logger.warning(
"Token indices sequence length is longer than the specified maximum sequence length "
"for this model ({} > {}). Running this sequence through the model will result in "
"indexing errors".format(len(encoded_inputs["input_ids"]), self.model_max_length)
)
self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True
# Padding
if padding != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
encoded_inputs = self.pad(
encoded_inputs,
max_length=max_length,
padding=padding.value,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
)
if return_length:
encoded_inputs["length"] = len(encoded_inputs["input_ids"])
batch_outputs = BatchEncoding(
encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
)
return batch_outputs
def _get_truncated_table_rows(
self,
query_tokens: List[str],
tokenized_table: TokenizedTable,
num_rows: int,
num_columns: int,
max_length: int,
truncation_strategy: Union[str, TapasTruncationStrategy],
) -> Tuple[int, int]:
"""
Truncates a sequence pair in-place following the strategy.
Args:
query_tokens (:obj:`List[str]`):
List of strings corresponding to the tokenized query.
tokenized_table (:obj:`TokenizedTable`):
Tokenized table
num_rows (:obj:`int`):
Total number of table rows
num_columns (:obj:`int`):
Total number of table columns
max_length (:obj:`int`):
Total maximum length.
truncation_strategy (:obj:`str` or :obj:`~transformers.TapasTruncationStrategy`):
Truncation strategy to use. Seeing as this method should only be called when truncating, the only
available strategy is the :obj:`"drop_rows_to_fit"` strategy.
Returns:
:obj:`Tuple(int, int)`: tuple containing the number of rows after truncation, and the number of tokens
available for each table element.
"""
if not isinstance(truncation_strategy, TapasTruncationStrategy):
truncation_strategy = TapasTruncationStrategy(truncation_strategy)
if max_length is None:
max_length = self.model_max_length
if truncation_strategy == TapasTruncationStrategy.DROP_ROWS_TO_FIT:
while True:
num_tokens = self._get_max_num_tokens(
query_tokens, tokenized_table, num_rows=num_rows, num_columns=num_columns, max_length=max_length
)
if num_tokens is not None:
# We could fit the table.
break
# Try to drop a row to fit the table.
num_rows -= 1
if num_rows < 1:
break
elif truncation_strategy != TapasTruncationStrategy.DO_NOT_TRUNCATE:
raise ValueError(f"Unknown truncation strategy {truncation_strategy}.")
return num_rows, num_tokens or 1
def _tokenize_table(
self,
table=None,
):
"""
Tokenizes column headers and cell texts of a table.
Args:
table (:obj:`pd.Dataframe`):
Table. Returns: :obj:`TokenizedTable`: TokenizedTable object.
"""
tokenized_rows = []
tokenized_row = []
# tokenize column headers
for column in table:
if self.strip_column_names:
tokenized_row.append(self.tokenize(""))
else:
tokenized_row.append(self.tokenize(column))
tokenized_rows.append(tokenized_row)
# tokenize cell values
for idx, row in table.iterrows():
tokenized_row = []
for cell in row:
tokenized_row.append(self.tokenize(cell))
tokenized_rows.append(tokenized_row)
token_coordinates = []
for row_index, row in enumerate(tokenized_rows):
for column_index, cell in enumerate(row):
for token_index, _ in enumerate(cell):
token_coordinates.append(
TokenCoordinates(
row_index=row_index,
column_index=column_index,
token_index=token_index,
)
)
return TokenizedTable(
rows=tokenized_rows,
selected_tokens=token_coordinates,
)
def _question_encoding_cost(self, question_tokens):
# Two extra spots of SEP and CLS.
return len(question_tokens) + 2
def _get_token_budget(self, question_tokens, max_length=None):
"""
Computes the number of tokens left for the table after tokenizing a question, taking into account the max
sequence length of the model.
Args:
question_tokens (:obj:`List[String]`):
List of question tokens. Returns: :obj:`int`: the number of tokens left for the table, given the model
max length.
"""
return (max_length if max_length is not None else self.model_max_length) - self._question_encoding_cost(
question_tokens
)
def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:
"""Iterates over partial table and returns token, column and row indexes."""
for tc in table.selected_tokens:
# First row is header row.
if tc.row_index >= num_rows + 1:
continue
if tc.column_index >= num_columns:
continue
cell = table.rows[tc.row_index][tc.column_index]
token = cell[tc.token_index]
word_begin_index = tc.token_index
# Don't add partial words. Find the starting word piece and check if it
# fits in the token budget.
while word_begin_index >= 0 and _is_inner_wordpiece(cell[word_begin_index]):
word_begin_index -= 1
if word_begin_index >= num_tokens:
continue
yield TableValue(token, tc.column_index + 1, tc.row_index)
def _get_table_boundaries(self, table):
"""Return maximal number of rows, columns and tokens."""
max_num_tokens = 0
max_num_columns = 0
max_num_rows = 0
for tc in table.selected_tokens:
max_num_columns = max(max_num_columns, tc.column_index + 1)
max_num_rows = max(max_num_rows, tc.row_index + 1)
max_num_tokens = max(max_num_tokens, tc.token_index + 1)
max_num_columns = min(self.max_column_id, max_num_columns)
max_num_rows = min(self.max_row_id, max_num_rows)
return max_num_rows, max_num_columns, max_num_tokens
def _get_table_cost(self, table, num_columns, num_rows, num_tokens):
return sum(1 for _ in self._get_table_values(table, num_columns, num_rows, num_tokens))
def _get_max_num_tokens(self, question_tokens, tokenized_table, num_columns, num_rows, max_length):
"""Computes max number of tokens that can be squeezed into the budget."""
token_budget = self._get_token_budget(question_tokens, max_length)
_, _, max_num_tokens = self._get_table_boundaries(tokenized_table)
if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:
max_num_tokens = self.cell_trim_length
num_tokens = 0
for num_tokens in range(max_num_tokens + 1):
cost = self._get_table_cost(tokenized_table, num_columns, num_rows, num_tokens + 1)
if cost > token_budget:
break
if num_tokens < max_num_tokens:
if self.cell_trim_length >= 0:
# We don't allow dynamic trimming if a cell_trim_length is set.
return None
if num_tokens == 0:
return None
return num_tokens
def _get_num_columns(self, table):
num_columns = table.shape[1]
if num_columns >= self.max_column_id:
raise ValueError("Too many columns")
return num_columns
def _get_num_rows(self, table, drop_rows_to_fit):
num_rows = table.shape[0]
if num_rows >= self.max_row_id:
if drop_rows_to_fit:
num_rows = self.max_row_id - 1
else:
raise ValueError("Too many rows")
return num_rows
def _serialize_text(self, question_tokens):
"""Serializes texts in index arrays."""
tokens = []
segment_ids = []
column_ids = []
row_ids = []
# add [CLS] token at the beginning
tokens.append(self.cls_token)
segment_ids.append(0)
column_ids.append(0)
row_ids.append(0)
for token in question_tokens:
tokens.append(token)
segment_ids.append(0)
column_ids.append(0)
row_ids.append(0)
return tokens, segment_ids, column_ids, row_ids
def _serialize(
self,
question_tokens,
table,
num_columns,
num_rows,
num_tokens,
):
"""Serializes table and text."""
tokens, segment_ids, column_ids, row_ids = self._serialize_text(question_tokens)
# add [SEP] token between question and table tokens
tokens.append(self.sep_token)
segment_ids.append(0)
column_ids.append(0)
row_ids.append(0)
for token, column_id, row_id in self._get_table_values(table, num_columns, num_rows, num_tokens):
tokens.append(token)
segment_ids.append(1)
column_ids.append(column_id)
row_ids.append(row_id)
return SerializedExample(
tokens=tokens,
segment_ids=segment_ids,
column_ids=column_ids,
row_ids=row_ids,
)
def _get_column_values(self, table, col_index):
table_numeric_values = {}
for row_index, row in table.iterrows():
cell = row[col_index]
if cell.numeric_value is not None:
table_numeric_values[row_index] = cell.numeric_value
return table_numeric_values
def _get_cell_token_indexes(self, column_ids, row_ids, column_id, row_id):
for index in range(len(column_ids)):
if column_ids[index] - 1 == column_id and row_ids[index] - 1 == row_id:
yield index
def _get_numeric_column_ranks(self, column_ids, row_ids, table):
"""Returns column ranks for all numeric columns."""
ranks = [0] * len(column_ids)
inv_ranks = [0] * len(column_ids)
# original code from tf_example_utils.py of the original implementation
if table is not None:
for col_index in range(len(table.columns)):
table_numeric_values = self._get_column_values(table, col_index)
if not table_numeric_values:
continue
try:
key_fn = get_numeric_sort_key_fn(table_numeric_values.values())
except ValueError:
continue
table_numeric_values = {row_index: key_fn(value) for row_index, value in table_numeric_values.items()}
table_numeric_values_inv = collections.defaultdict(list)
for row_index, value in table_numeric_values.items():
table_numeric_values_inv[value].append(row_index)
unique_values = sorted(table_numeric_values_inv.keys())
for rank, value in enumerate(unique_values):
for row_index in table_numeric_values_inv[value]:
for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index):
ranks[index] = rank + 1
inv_ranks[index] = len(unique_values) - rank
return ranks, inv_ranks
def _get_numeric_sort_key_fn(self, table_numeric_values, value):
"""
Returns the sort key function for comparing value to table values. The function returned will be a suitable
input for the key param of the sort(). See number_annotation_utils._get_numeric_sort_key_fn for details
Args:
table_numeric_values: Numeric values of a column
value: Numeric value in the question
Returns:
A function key function to compare column and question values.
"""
if not table_numeric_values:
return None
all_values = list(table_numeric_values.values())
all_values.append(value)
try:
return get_numeric_sort_key_fn(all_values)
except ValueError:
return None
def _get_numeric_relations(self, question, column_ids, row_ids, table):
"""
Returns numeric relations embeddings
Args:
question: Question object.
column_ids: Maps word piece position to column id.
row_ids: Maps word piece position to row id.
table: The table containing the numeric cell values.
"""
numeric_relations = [0] * len(column_ids)
# first, we add any numeric value spans to the question:
# Create a dictionary that maps a table cell to the set of all relations
# this cell has with any value in the question.
cell_indices_to_relations = collections.defaultdict(set)
if question is not None and table is not None:
for numeric_value_span in question.numeric_spans:
for value in numeric_value_span.values:
for column_index in range(len(table.columns)):
table_numeric_values = self._get_column_values(table, column_index)
sort_key_fn = self._get_numeric_sort_key_fn(table_numeric_values, value)
if sort_key_fn is None:
continue
for row_index, cell_value in table_numeric_values.items():
relation = get_numeric_relation(value, cell_value, sort_key_fn)
if relation is not None:
cell_indices_to_relations[column_index, row_index].add(relation)
# For each cell add a special feature for all its word pieces.
for (column_index, row_index), relations in cell_indices_to_relations.items():
relation_set_index = 0
for relation in relations:
assert relation.value >= Relation.EQ.value
relation_set_index += 2 ** (relation.value - Relation.EQ.value)
for cell_token_index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index):
numeric_relations[cell_token_index] = relation_set_index
return numeric_relations
def _get_numeric_values(self, table, column_ids, row_ids):
"""Returns numeric values for computation of answer loss."""
numeric_values = [float("nan")] * len(column_ids)
if table is not None:
num_rows = table.shape[0]
num_columns = table.shape[1]
for col_index in range(num_columns):
for row_index in range(num_rows):
numeric_value = table.iloc[row_index, col_index].numeric_value
if numeric_value is not None:
if numeric_value.float_value is None:
continue
float_value = numeric_value.float_value
if float_value == float("inf"):
continue
for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index):
numeric_values[index] = float_value
return numeric_values
def _get_numeric_values_scale(self, table, column_ids, row_ids):
"""Returns a scale to each token to down weigh the value of long words."""
numeric_values_scale = [1.0] * len(column_ids)
if table is None:
return numeric_values_scale
num_rows = table.shape[0]
num_columns = table.shape[1]
for col_index in range(num_columns):
for row_index in range(num_rows):
indices = [index for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index)]
num_indices = len(indices)
if num_indices > 1:
for index in indices:
numeric_values_scale[index] = float(num_indices)
return numeric_values_scale
def _pad_to_seq_length(self, inputs):
while len(inputs) > self.model_max_length:
inputs.pop()
while len(inputs) < self.model_max_length:
inputs.append(0)
def _get_all_answer_ids_from_coordinates(
self,
column_ids,
row_ids,
answers_list,
):
"""Maps lists of answer coordinates to token indexes."""
answer_ids = [0] * len(column_ids)
found_answers = set()
all_answers = set()
for answers in answers_list:
column_index, row_index = answers
all_answers.add((column_index, row_index))
for index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index):
found_answers.add((column_index, row_index))
answer_ids[index] = 1
missing_count = len(all_answers) - len(found_answers)
return answer_ids, missing_count
def _get_all_answer_ids(self, column_ids, row_ids, answer_coordinates):
"""
Maps answer coordinates of a question to token indexes.
In the SQA format (TSV), the coordinates are given as (row, column) tuples. Here, we first swap them to
(column, row) format before calling _get_all_answer_ids_from_coordinates.
"""
def _to_coordinates(answer_coordinates_question):
return [(coords[1], coords[0]) for coords in answer_coordinates_question]
return self._get_all_answer_ids_from_coordinates(
column_ids, row_ids, answers_list=(_to_coordinates(answer_coordinates))
)
def _find_tokens(self, text, segment):
"""Return start index of segment in text or None."""
logging.info("text: %s %s", text, segment)
for index in range(1 + len(text) - len(segment)):
for seg_index, seg_token in enumerate(segment):
if text[index + seg_index].piece != seg_token.piece:
break
else:
return index
return None
def _find_answer_coordinates_from_answer_text(
self,
tokenized_table,
answer_text,
):
"""Returns all occurrences of answer_text in the table."""
logging.info("answer text: %s", answer_text)
for row_index, row in enumerate(tokenized_table.rows):
if row_index == 0:
# We don't search for answers in the header.
continue
for col_index, cell in enumerate(row):
token_index = self._find_tokens(cell, answer_text)
if token_index is not None:
yield TokenCoordinates(
row_index=row_index,
column_index=col_index,
token_index=token_index,
)
def _find_answer_ids_from_answer_texts(
self,
column_ids,
row_ids,
tokenized_table,
answer_texts,
):
"""Maps question with answer texts to the first matching token indexes."""
answer_ids = [0] * len(column_ids)
for answer_text in answer_texts:
for coordinates in self._find_answer_coordinates_from_answer_text(
tokenized_table,
answer_text,
):
# Maps answer coordinates to indexes this can fail if tokens / rows have
# been pruned.
indexes = list(
self._get_cell_token_indexes(
column_ids,
row_ids,
column_id=coordinates.column_index,
row_id=coordinates.row_index - 1,
)
)
indexes.sort()
coordinate_answer_ids = []
if indexes:
begin_index = coordinates.token_index + indexes[0]
end_index = begin_index + len(answer_text)
for index in indexes:
if index >= begin_index and index < end_index:
coordinate_answer_ids.append(index)
if len(coordinate_answer_ids) == len(answer_text):
for index in coordinate_answer_ids:
answer_ids[index] = 1
break
return answer_ids
def _get_answer_ids(self, column_ids, row_ids, answer_coordinates):
"""Maps answer coordinates of a question to token indexes."""
answer_ids, missing_count = self._get_all_answer_ids(column_ids, row_ids, answer_coordinates)
if missing_count:
raise ValueError("Couldn't find all answers")
return answer_ids
def get_answer_ids(self, column_ids, row_ids, tokenized_table, answer_texts_question, answer_coordinates_question):
if self.update_answer_coordinates:
return self._find_answer_ids_from_answer_texts(
column_ids,
row_ids,
tokenized_table,
answer_texts=[self.tokenize(at) for at in answer_texts_question],
)
return self._get_answer_ids(column_ids, row_ids, answer_coordinates_question)
def _pad(
self,
encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
max_length: Optional[int] = None,
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
) -> dict:
"""
Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
Args:
encoded_inputs: Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
max_length: maximum length of the returned list and optionally padding length (see below).
Will truncate by taking into account the special tokens.
padding_strategy: PaddingStrategy to use for padding.
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
- PaddingStrategy.DO_NOT_PAD: Do not pad
The tokenizer padding sides are defined in self.padding_side:
- 'left': pads on the left of the sequences
- 'right': pads on the right of the sequences
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)
"""
# Load from model defaults
if return_attention_mask is None:
return_attention_mask = "attention_mask" in self.model_input_names
if padding_strategy == PaddingStrategy.LONGEST:
max_length = len(encoded_inputs["input_ids"])
if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
needs_to_be_padded = (
padding_strategy != PaddingStrategy.DO_NOT_PAD and len(encoded_inputs["input_ids"]) != max_length
)
if needs_to_be_padded:
difference = max_length - len(encoded_inputs["input_ids"])
if self.padding_side == "right":
if return_attention_mask:
encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference
if "token_type_ids" in encoded_inputs:
encoded_inputs["token_type_ids"] = (
encoded_inputs["token_type_ids"] + [[self.pad_token_type_id] * 7] * difference
)
if "labels" in encoded_inputs:
encoded_inputs["labels"] = encoded_inputs["labels"] + [0] * difference
if "numeric_values" in encoded_inputs:
encoded_inputs["numeric_values"] = encoded_inputs["numeric_values"] + [float("nan")] * difference
if "numeric_values_scale" in encoded_inputs:
encoded_inputs["numeric_values_scale"] = (
encoded_inputs["numeric_values_scale"] + [1.0] * difference
)
if "special_tokens_mask" in encoded_inputs:
encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.pad_token_id] * difference
elif self.padding_side == "left":
if return_attention_mask:
encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"])
if "token_type_ids" in encoded_inputs:
encoded_inputs["token_type_ids"] = [[self.pad_token_type_id] * 7] * difference + encoded_inputs[
"token_type_ids"
]
if "labels" in encoded_inputs:
encoded_inputs["labels"] = [0] * difference + encoded_inputs["labels"]
if "numeric_values" in encoded_inputs:
encoded_inputs["numeric_values"] = [float("nan")] * difference + encoded_inputs["numeric_values"]
if "numeric_values_scale" in encoded_inputs:
encoded_inputs["numeric_values_scale"] = [1.0] * difference + encoded_inputs[
"numeric_values_scale"
]
if "special_tokens_mask" in encoded_inputs:
encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs["input_ids"]
else:
raise ValueError("Invalid padding strategy:" + str(self.padding_side))
else:
if return_attention_mask:
encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
return encoded_inputs
# Everything related to converting logits to predictions
def _get_cell_token_probs(self, probabilities, segment_ids, row_ids, column_ids):
for i, p in enumerate(probabilities):
segment_id = segment_ids[i]
col = column_ids[i] - 1
row = row_ids[i] - 1
if col >= 0 and row >= 0 and segment_id == 1:
yield i, p
def _get_mean_cell_probs(self, probabilities, segment_ids, row_ids, column_ids):
"""Computes average probability per cell, aggregating over tokens."""
coords_to_probs = collections.defaultdict(list)
for i, prob in self._get_cell_token_probs(probabilities, segment_ids, row_ids, column_ids):
col = column_ids[i] - 1
row = row_ids[i] - 1
coords_to_probs[(col, row)].append(prob)
return {coords: np.array(cell_probs).mean() for coords, cell_probs in coords_to_probs.items()}
def convert_logits_to_predictions(self, data, logits, logits_agg=None, cell_classification_threshold=0.5):
"""
Converts logits of :class:`~transformers.TapasForQuestionAnswering` to actual predicted answer coordinates and
optional aggregation indices.
The original implementation, on which this function is based, can be found `here
<https://github.com/google-research/tapas/blob/4908213eb4df7aa988573350278b44c4dbe3f71b/tapas/experiments/prediction_utils.py#L288>`__.
Args:
data (:obj:`dict`):
Dictionary mapping features to actual values. Should be created using
:class:`~transformers.TapasTokenizer`.
logits (:obj:`np.ndarray` of shape ``(batch_size, sequence_length)``):
Tensor containing the logits at the token level.
logits_agg (:obj:`np.ndarray` of shape ``(batch_size, num_aggregation_labels)``, `optional`):
Tensor containing the aggregation logits.
cell_classification_threshold (:obj:`float`, `optional`, defaults to 0.5):
Threshold to be used for cell selection. All table cells for which their probability is larger than
this threshold will be selected.
Returns:
:obj:`tuple` comprising various elements depending on the inputs: predicted_answer_coordinates
(``List[List[[tuple]]`` of length ``batch_size``): Predicted answer coordinates as a list of lists of
tuples. Each element in the list contains the predicted answer coordinates of a single example in the
batch, as a list of tuples. Each tuple is a cell, i.e. (row index, column index).
predicted_aggregation_indices (`optional`, returned when ``logits_aggregation`` is provided) ``List[int]``
of length ``batch_size``: Predicted aggregation operator indices of the aggregation head.
"""
# input data is of type float32
# np.log(np.finfo(np.float32).max) = 88.72284
# Any value over 88.72284 will overflow when passed through the exponential, sending a warning
# We disable this warning by truncating the logits.
logits[logits < -88.7] = -88.7
# Compute probabilities from token logits
probabilities = 1 / (1 + np.exp(-logits)) * data["attention_mask"]
token_types = [
"segment_ids",
"column_ids",
"row_ids",
"prev_labels",
"column_ranks",
"inv_column_ranks",
"numeric_relations",
]
# collect input_ids, segment ids, row ids and column ids of batch. Shape (batch_size, seq_len)
input_ids = data["input_ids"]
segment_ids = data["token_type_ids"][:, :, token_types.index("segment_ids")]
row_ids = data["token_type_ids"][:, :, token_types.index("row_ids")]
column_ids = data["token_type_ids"][:, :, token_types.index("column_ids")]
# next, get answer coordinates for every example in the batch
num_batch = input_ids.shape[0]
predicted_answer_coordinates = []
for i in range(num_batch):
probabilities_example = probabilities[i].tolist()
segment_ids_example = segment_ids[i]
row_ids_example = row_ids[i]
column_ids_example = column_ids[i]
max_width = column_ids_example.max()
max_height = row_ids_example.max()
if max_width == 0 and max_height == 0:
continue
cell_coords_to_prob = self._get_mean_cell_probs(
probabilities_example,
segment_ids_example.tolist(),
row_ids_example.tolist(),
column_ids_example.tolist(),
)
# Select the answers above the classification threshold.
answer_coordinates = []
for col in range(max_width):
for row in range(max_height):
cell_prob = cell_coords_to_prob.get((col, row), None)
if cell_prob is not None:
if cell_prob > cell_classification_threshold:
answer_coordinates.append((row, col))
answer_coordinates = sorted(answer_coordinates)
predicted_answer_coordinates.append(answer_coordinates)
output = predicted_answer_coordinates
if logits_agg is not None:
predicted_aggregation_indices = logits_agg.argmax(dim=-1)
output = (output, predicted_aggregation_indices.tolist())
return output
# End of everything related to converting logits to predictions
# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
class BasicTokenizer(object):
"""
Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
Args:
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
never_split (:obj:`Iterable`, `optional`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters.
This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
"""
def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
if never_split is None:
never_split = []
self.do_lower_case = do_lower_case
self.never_split = set(never_split)
self.tokenize_chinese_chars = tokenize_chinese_chars
self.strip_accents = strip_accents
def tokenize(self, text, never_split=None):
"""
Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
WordPieceTokenizer.
Args:
**never_split**: (`optional`) list of str
Kept for backward compatibility purposes. Now implemented directly at the base class level (see
:func:`PreTrainedTokenizer.tokenize`) List of token not to split.
"""
# union() returns a new set by concatenating the two sets.
never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
if self.tokenize_chinese_chars:
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if token not in never_split:
if self.do_lower_case:
token = token.lower()
if self.strip_accents is not False:
token = self._run_strip_accents(token)
elif self.strip_accents:
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token, never_split))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text, never_split=None):
"""Splits punctuation on a piece of text."""
if never_split is not None and text in never_split:
return [text]
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if (
(cp >= 0x4E00 and cp <= 0x9FFF)
or (cp >= 0x3400 and cp <= 0x4DBF) #
or (cp >= 0x20000 and cp <= 0x2A6DF) #
or (cp >= 0x2A700 and cp <= 0x2B73F) #
or (cp >= 0x2B740 and cp <= 0x2B81F) #
or (cp >= 0x2B820 and cp <= 0x2CEAF) #
or (cp >= 0xF900 and cp <= 0xFAFF)
or (cp >= 0x2F800 and cp <= 0x2FA1F) #
): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xFFFD or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
# Copied from transformers.models.bert.tokenization_bert.WordpieceTokenizer
class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""
def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""
Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
tokenization using the given vocabulary.
For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.
Returns:
A list of wordpiece tokens.
"""
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
# Below: utilities for TAPAS tokenizer (independent from PyTorch/Tensorflow).
# This includes functions to parse numeric values (dates and numbers) from both the table and questions in order
# to create the column_ranks, inv_column_ranks, numeric_values, numeric values_scale and numeric_relations in
# prepare_for_model of TapasTokenizer.
# These are meant to be used in an academic setup, for production use cases Gold mine or Aqua should be used.
# taken from constants.py of the original implementation
# URL: https://github.com/google-research/tapas/blob/master/tapas/utils/constants.py
class Relation(enum.Enum):
HEADER_TO_CELL = 1 # Connects header to cell.
CELL_TO_HEADER = 2 # Connects cell to header.
QUERY_TO_HEADER = 3 # Connects query to headers.
QUERY_TO_CELL = 4 # Connects query to cells.
ROW_TO_CELL = 5 # Connects row to cells.
CELL_TO_ROW = 6 # Connects cells to row.
EQ = 7 # Annotation value is same as cell value
LT = 8 # Annotation value is less than cell value
GT = 9 # Annotation value is greater than cell value
@dataclass
class Date:
year: Optional[int] = None
month: Optional[int] = None
day: Optional[int] = None
@dataclass
class NumericValue:
float_value: Optional[float] = None
date: Optional[Date] = None
@dataclass
class NumericValueSpan:
begin_index: int = None
end_index: int = None
values: List[NumericValue] = None
@dataclass
class Cell:
text: Text
numeric_value: Optional[NumericValue] = None
@dataclass
class Question:
original_text: Text # The original raw question string.
text: Text # The question string after normalization.
numeric_spans: Optional[List[NumericValueSpan]] = None
# Below: all functions from number_utils.py as well as 2 functions (namely get_all_spans and normalize_for_match)
# from text_utils.py of the original implementation. URL's:
# - https://github.com/google-research/tapas/blob/master/tapas/utils/number_utils.py
# - https://github.com/google-research/tapas/blob/master/tapas/utils/text_utils.py
# Constants for parsing date expressions.
# Masks that specify (by a bool) which of (year, month, day) will be populated.
_DateMask = collections.namedtuple("_DateMask", ["year", "month", "day"])
_YEAR = _DateMask(True, False, False)
_YEAR_MONTH = _DateMask(True, True, False)
_YEAR_MONTH_DAY = _DateMask(True, True, True)
_MONTH = _DateMask(False, True, False)
_MONTH_DAY = _DateMask(False, True, True)
# Pairs of patterns to pass to 'datetime.strptime' and masks specifying which
# fields will be set by the corresponding pattern.
_DATE_PATTERNS = (
("%B", _MONTH),
("%Y", _YEAR),
("%Ys", _YEAR),
("%b %Y", _YEAR_MONTH),
("%B %Y", _YEAR_MONTH),
("%B %d", _MONTH_DAY),
("%b %d", _MONTH_DAY),
("%d %b", _MONTH_DAY),
("%d %B", _MONTH_DAY),
("%B %d, %Y", _YEAR_MONTH_DAY),
("%d %B %Y", _YEAR_MONTH_DAY),
("%m-%d-%Y", _YEAR_MONTH_DAY),
("%Y-%m-%d", _YEAR_MONTH_DAY),
("%Y-%m", _YEAR_MONTH),
("%B %Y", _YEAR_MONTH),
("%d %b %Y", _YEAR_MONTH_DAY),
("%Y-%m-%d", _YEAR_MONTH_DAY),
("%b %d, %Y", _YEAR_MONTH_DAY),
("%d.%m.%Y", _YEAR_MONTH_DAY),
("%A, %b %d", _MONTH_DAY),
("%A, %B %d", _MONTH_DAY),
)
# This mapping is used to convert date patterns to regex patterns.
_FIELD_TO_REGEX = (
("%A", r"\w+"), # Weekday as locale’s full name.
("%B", r"\w+"), # Month as locale’s full name.
("%Y", r"\d{4}"), # Year with century as a decimal number.
("%b", r"\w{3}"), # Month as locale’s abbreviated name.
("%d", r"\d{1,2}"), # Day of the month as a zero-padded decimal number.
("%m", r"\d{1,2}"), # Month as a zero-padded decimal number.
)
def _process_date_pattern(dp):
"""Compute a regex for each date pattern to use as a prefilter."""
pattern, mask = dp
regex = pattern
regex = regex.replace(".", re.escape("."))
regex = regex.replace("-", re.escape("-"))
regex = regex.replace(" ", r"\s+")
for field, field_regex in _FIELD_TO_REGEX:
regex = regex.replace(field, field_regex)
# Make sure we didn't miss any of the fields.
assert "%" not in regex, regex
return pattern, mask, re.compile("^" + regex + "$")
def _process_date_patterns():
return tuple(_process_date_pattern(dp) for dp in _DATE_PATTERNS)
_PROCESSED_DATE_PATTERNS = _process_date_patterns()
_MAX_DATE_NGRAM_SIZE = 5
# Following DynSp:
# https://github.com/Microsoft/DynSP/blob/master/util.py#L414.
_NUMBER_WORDS = [
"zero",
"one",
"two",
"three",
"four",
"five",
"six",
"seven",
"eight",
"nine",
"ten",
"eleven",
"twelve",
]
_ORDINAL_WORDS = [
"zeroth",
"first",
"second",
"third",
"fourth",
"fith",
"sixth",
"seventh",
"eighth",
"ninth",
"tenth",
"eleventh",
"twelfth",
]
_ORDINAL_SUFFIXES = ["st", "nd", "rd", "th"]
_NUMBER_PATTERN = re.compile(r"((^|\s)[+-])?((\.\d+)|(\d+(,\d\d\d)*(\.\d*)?))")
# Following DynSp:
# https://github.com/Microsoft/DynSP/blob/master/util.py#L293.
_MIN_YEAR = 1700
_MAX_YEAR = 2016
_INF = float("INF")
def _get_numeric_value_from_date(date, mask):
"""Converts date (datetime Python object) to a NumericValue object with a Date object value."""
if date.year < _MIN_YEAR or date.year > _MAX_YEAR:
raise ValueError("Invalid year: %d" % date.year)
new_date = Date()
if mask.year:
new_date.year = date.year
if mask.month:
new_date.month = date.month
if mask.day:
new_date.day = date.day
return NumericValue(date=new_date)
def _get_span_length_key(span):
"""Sorts span by decreasing length first and incresing first index second."""
return span[1] - span[0], -span[0]
def _get_numeric_value_from_float(value):
"""Converts float (Python) to a NumericValue object with a float value."""
return NumericValue(float_value=value)
# Doesn't parse ordinal expressions such as '18th of february 1655'.
def _parse_date(text):
"""Attempts to format a text as a standard date string (yyyy-mm-dd)."""
text = re.sub(r"Sept\b", "Sep", text)
for in_pattern, mask, regex in _PROCESSED_DATE_PATTERNS:
if not regex.match(text):
continue
try:
date = datetime.datetime.strptime(text, in_pattern).date()
except ValueError:
continue
try:
return _get_numeric_value_from_date(date, mask)
except ValueError:
continue
return None
def _parse_number(text):
"""Parses simple cardinal and ordinals numbers."""
for suffix in _ORDINAL_SUFFIXES:
if text.endswith(suffix):
text = text[: -len(suffix)]
break
text = text.replace(",", "")
try:
value = float(text)
except ValueError:
return None
if math.isnan(value):
return None
if value == _INF:
return None
return value
def get_all_spans(text, max_ngram_length):
"""
Split a text into all possible ngrams up to 'max_ngram_length'. Split points are white space and punctuation.
Args:
text: Text to split.
max_ngram_length: maximal ngram length.
Yields:
Spans, tuples of begin-end index.
"""
start_indexes = []
for index, char in enumerate(text):
if not char.isalnum():
continue
if index == 0 or not text[index - 1].isalnum():
start_indexes.append(index)
if index + 1 == len(text) or not text[index + 1].isalnum():
for start_index in start_indexes[-max_ngram_length:]:
yield start_index, index + 1
def normalize_for_match(text):
return " ".join(text.lower().split())
def format_text(text):
"""Lowercases and strips punctuation."""
text = text.lower().strip()
if text == "n/a" or text == "?" or text == "nan":
text = EMPTY_TEXT
text = re.sub(r"[^\w\d]+", " ", text).replace("_", " ")
text = " ".join(text.split())
text = text.strip()
if text:
return text
return EMPTY_TEXT
def parse_text(text):
"""
Extracts longest number and date spans.
Args:
text: text to annotate
Returns:
List of longest numeric value spans.
"""
span_dict = collections.defaultdict(list)
for match in _NUMBER_PATTERN.finditer(text):
span_text = text[match.start() : match.end()]
number = _parse_number(span_text)
if number is not None:
span_dict[match.span()].append(_get_numeric_value_from_float(number))
for begin_index, end_index in get_all_spans(text, max_ngram_length=1):
if (begin_index, end_index) in span_dict:
continue
span_text = text[begin_index:end_index]
number = _parse_number(span_text)
if number is not None:
span_dict[begin_index, end_index].append(_get_numeric_value_from_float(number))
for number, word in enumerate(_NUMBER_WORDS):
if span_text == word:
span_dict[begin_index, end_index].append(_get_numeric_value_from_float(float(number)))
break
for number, word in enumerate(_ORDINAL_WORDS):
if span_text == word:
span_dict[begin_index, end_index].append(_get_numeric_value_from_float(float(number)))
break
for begin_index, end_index in get_all_spans(text, max_ngram_length=_MAX_DATE_NGRAM_SIZE):
span_text = text[begin_index:end_index]
date = _parse_date(span_text)
if date is not None:
span_dict[begin_index, end_index].append(date)
spans = sorted(span_dict.items(), key=lambda span_value: _get_span_length_key(span_value[0]), reverse=True)
selected_spans = []
for span, value in spans:
for selected_span, _ in selected_spans:
if selected_span[0] <= span[0] and span[1] <= selected_span[1]:
break
else:
selected_spans.append((span, value))
selected_spans.sort(key=lambda span_value: span_value[0][0])
numeric_value_spans = []
for span, values in selected_spans:
numeric_value_spans.append(NumericValueSpan(begin_index=span[0], end_index=span[1], values=values))
return numeric_value_spans
# Below: all functions from number_annotation_utils.py and 2 functions (namely filter_invalid_unicode
# and filter_invalid_unicode_from_table) from text_utils.py of the original implementation. URL's:
# - https://github.com/google-research/tapas/blob/master/tapas/utils/number_annotation_utils.py
# - https://github.com/google-research/tapas/blob/master/tapas/utils/text_utils.py
_PrimitiveNumericValue = Union[float, Tuple[Optional[float], Optional[float], Optional[float]]]
_SortKeyFn = Callable[[NumericValue], Tuple[float, Ellipsis]]
_DATE_TUPLE_SIZE = 3
EMPTY_TEXT = "EMPTY"
NUMBER_TYPE = "number"
DATE_TYPE = "date"
def _get_value_type(numeric_value):
if numeric_value.float_value is not None:
return NUMBER_TYPE
elif numeric_value.date is not None:
return DATE_TYPE
raise ValueError("Unknown type: %s" % numeric_value)
def _get_value_as_primitive_value(numeric_value):
"""Maps a NumericValue proto to a float or tuple of float."""
if numeric_value.float_value is not None:
return numeric_value.float_value
if numeric_value.date is not None:
date = numeric_value.date
value_tuple = [None, None, None]
# All dates fields are cased to float to produce a simple primitive value.
if date.year is not None:
value_tuple[0] = float(date.year)
if date.month is not None:
value_tuple[1] = float(date.month)
if date.day is not None:
value_tuple[2] = float(date.day)
return tuple(value_tuple)
raise ValueError("Unknown type: %s" % numeric_value)
def _get_all_types(numeric_values):
return {_get_value_type(value) for value in numeric_values}
def get_numeric_sort_key_fn(numeric_values):
"""
Creates a function that can be used as a sort key or to compare the values. Maps to primitive types and finds the
biggest common subset. Consider the values "05/05/2010" and "August 2007". With the corresponding primitive values
(2010.,5.,5.) and (2007.,8., None). These values can be compared by year and date so we map to the sequence (2010.,
5.), (2007., 8.). If we added a third value "2006" with primitive value (2006., None, None), we could only compare
by the year so we would map to (2010.,), (2007.,) and (2006.,).
Args:
numeric_values: Values to compare
Returns:
A function that can be used as a sort key function (mapping numeric values to a comparable tuple)
Raises:
ValueError if values don't have a common type or are not comparable.
"""
value_types = _get_all_types(numeric_values)
if len(value_types) != 1:
raise ValueError("No common value type in %s" % numeric_values)
value_type = next(iter(value_types))
if value_type == NUMBER_TYPE:
# Primitive values are simple floats, nothing to do here.
return _get_value_as_primitive_value
# The type can only be Date at this point which means the primitive type
# is a float triple.
valid_indexes = set(range(_DATE_TUPLE_SIZE))
for numeric_value in numeric_values:
value = _get_value_as_primitive_value(numeric_value)
assert isinstance(value, tuple)
for tuple_index, inner_value in enumerate(value):
if inner_value is None:
valid_indexes.discard(tuple_index)
if not valid_indexes:
raise ValueError("No common value in %s" % numeric_values)
def _sort_key_fn(numeric_value):
value = _get_value_as_primitive_value(numeric_value)
return tuple(value[index] for index in valid_indexes)
return _sort_key_fn
def _consolidate_numeric_values(row_index_to_values, min_consolidation_fraction, debug_info):
"""
Finds the most common numeric values in a column and returns them
Args:
row_index_to_values:
For each row index all the values in that cell.
min_consolidation_fraction:
Fraction of cells that need to have consolidated value.
debug_info:
Additional information only used for logging
Returns:
For each row index the first value that matches the most common value. Rows that don't have a matching value
are dropped. Empty list if values can't be consolidated.
"""
type_counts = collections.Counter()
for numeric_values in row_index_to_values.values():
type_counts.update(_get_all_types(numeric_values))
if not type_counts:
return {}
max_count = max(type_counts.values())
if max_count < len(row_index_to_values) * min_consolidation_fraction:
# logging.log_every_n(logging.INFO, 'Can\'t consolidate types: %s %s %d', 100,
# debug_info, row_index_to_values, max_count)
return {}
valid_types = set()
for value_type, count in type_counts.items():
if count == max_count:
valid_types.add(value_type)
if len(valid_types) > 1:
assert DATE_TYPE in valid_types
max_type = DATE_TYPE
else:
max_type = next(iter(valid_types))
new_row_index_to_value = {}
for index, values in row_index_to_values.items():
# Extract the first matching value.
for value in values:
if _get_value_type(value) == max_type:
new_row_index_to_value[index] = value
break
return new_row_index_to_value
def _get_numeric_values(text):
"""Parses text and returns numeric values."""
numeric_spans = parse_text(text)
return itertools.chain(*(span.values for span in numeric_spans))
def _get_column_values(table, col_index):
"""
Parses text in column and returns a dict mapping row_index to values. This is the _get_column_values function from
number_annotation_utils.py of the original implementation
Args:
table: Pandas dataframe
col_index: integer, indicating the index of the column to get the numeric values of
"""
index_to_values = {}
for row_index, row in table.iterrows():
text = normalize_for_match(row[col_index].text)
index_to_values[row_index] = list(_get_numeric_values(text))
return index_to_values
def get_numeric_relation(value, other_value, sort_key_fn):
"""Compares two values and returns their relation or None."""
value = sort_key_fn(value)
other_value = sort_key_fn(other_value)
if value == other_value:
return Relation.EQ
if value < other_value:
return Relation.LT
if value > other_value:
return Relation.GT
return None
def add_numeric_values_to_question(question):
"""Adds numeric value spans to a question."""
original_text = question
question = normalize_for_match(question)
numeric_spans = parse_text(question)
return Question(original_text=original_text, text=question, numeric_spans=numeric_spans)
def filter_invalid_unicode(text):
"""Return an empty string and True if 'text' is in invalid unicode."""
return ("", True) if isinstance(text, bytes) else (text, False)
def filter_invalid_unicode_from_table(table):
"""
Removes invalid unicode from table. Checks whether a table cell text contains an invalid unicode encoding. If yes,
reset the table cell text to an empty str and log a warning for each invalid cell
Args:
table: table to clean.
"""
# to do: add table id support
if not hasattr(table, "table_id"):
table.table_id = 0
for row_index, row in table.iterrows():
for col_index, cell in enumerate(row):
cell, is_invalid = filter_invalid_unicode(cell)
if is_invalid:
logging.warning(
"Scrub an invalid table body @ table_id: %s, row_index: %d, " "col_index: %d",
table.table_id,
row_index,
col_index,
)
for col_index, column in enumerate(table.columns):
column, is_invalid = filter_invalid_unicode(column)
if is_invalid:
logging.warning("Scrub an invalid table header @ table_id: %s, col_index: %d", table.table_id, col_index)
def add_numeric_table_values(table, min_consolidation_fraction=0.7, debug_info=None):
"""
Parses text in table column-wise and adds the consolidated values. Consolidation refers to finding values with a
common types (date or number)
Args:
table:
Table to annotate.
min_consolidation_fraction:
Fraction of cells in a column that need to have consolidated value.
debug_info:
Additional information used for logging.
"""
table = table.copy()
# First, filter table on invalid unicode
filter_invalid_unicode_from_table(table)
# Second, replace cell values by Cell objects
for row_index, row in table.iterrows():
for col_index, cell in enumerate(row):
table.iloc[row_index, col_index] = Cell(text=cell)
# Third, add numeric_value attributes to these Cell objects
for col_index, column in enumerate(table.columns):
column_values = _consolidate_numeric_values(
_get_column_values(table, col_index),
min_consolidation_fraction=min_consolidation_fraction,
debug_info=(debug_info, column),
)
for row_index, numeric_value in column_values.items():
table.iloc[row_index, col_index].numeric_value = numeric_value
return table
......@@ -28,6 +28,8 @@ from .file_utils import (
_datasets_available,
_faiss_available,
_flax_available,
_pandas_available,
_scatter_available,
_sentencepiece_available,
_tf_available,
_tokenizers_available,
......@@ -221,6 +223,27 @@ def require_tokenizers(test_case):
return test_case
def require_pandas(test_case):
"""
Decorator marking a test that requires pandas. These tests are skipped when pandas isn't installed.
"""
if not _pandas_available:
return unittest.skip("test requires pandas")(test_case)
else:
return test_case
def require_scatter(test_case):
"""
Decorator marking a test that requires PyTorch Scatter. These tests are skipped when PyTorch Scatter isn't
installed.
"""
if not _scatter_available:
return unittest.skip("test requires PyTorch Scatter")(test_case)
else:
return test_case
def require_torch_multi_gpu(test_case):
"""
Decorator marking a test that requires a multi-GPU setup (in PyTorch).
......
......@@ -1867,6 +1867,45 @@ def load_tf_weights_in_t5(*args, **kwargs):
requires_pytorch(load_tf_weights_in_t5)
TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST = None
class TapasForMaskedLM:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class TapasForQuestionAnswering:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class TapasForSequenceClassification:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class TapasModel:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import copy
import unittest
import numpy as np
import pandas as pd
from transformers import (
MODEL_FOR_CAUSAL_LM_MAPPING,
MODEL_FOR_MASKED_LM_MAPPING,
MODEL_FOR_MULTIPLE_CHOICE_MAPPING,
MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING,
MODEL_FOR_QUESTION_ANSWERING_MAPPING,
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
is_torch_available,
)
from transformers.file_utils import cached_property
from transformers.testing_utils import require_scatter, require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
if is_torch_available():
import torch
from transformers import (
TapasConfig,
TapasForMaskedLM,
TapasForQuestionAnswering,
TapasForSequenceClassification,
TapasModel,
TapasTokenizer,
)
from transformers.models.tapas.modeling_tapas import (
IndexMap,
ProductIndexMap,
flatten,
gather,
range_index_map,
reduce_max,
reduce_mean,
reduce_sum,
)
class TapasModelTester:
"""You can also import this e.g from .test_modeling_tapas import TapasModelTester """
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_input_mask=True,
use_token_type_ids=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
max_position_embeddings=512,
type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10],
type_sequence_label_size=2,
positive_weight=10.0,
num_aggregation_labels=4,
num_labels=2,
aggregation_loss_importance=0.8,
use_answer_as_supervision=True,
answer_loss_importance=0.001,
use_normalized_answer_loss=False,
huber_loss_delta=25.0,
temperature=1.0,
agg_temperature=1.0,
use_gumbel_for_cells=False,
use_gumbel_for_agg=False,
average_approximation_function="ratio",
cell_selection_preference=0.5,
answer_loss_cutoff=100,
max_num_rows=64,
max_num_columns=32,
average_logits_per_cell=True,
select_one_column=True,
allow_empty_column_selection=False,
init_cell_selection_weights_to_zero=False,
reset_position_index_per_cell=True,
disable_per_token_loss=False,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.max_position_embeddings = max_position_embeddings
self.type_vocab_sizes = type_vocab_sizes
self.type_sequence_label_size = type_sequence_label_size
self.positive_weight = positive_weight
self.num_aggregation_labels = num_aggregation_labels
self.num_labels = num_labels
self.aggregation_loss_importance = aggregation_loss_importance
self.use_answer_as_supervision = use_answer_as_supervision
self.answer_loss_importance = answer_loss_importance
self.use_normalized_answer_loss = use_normalized_answer_loss
self.huber_loss_delta = huber_loss_delta
self.temperature = temperature
self.agg_temperature = agg_temperature
self.use_gumbel_for_cells = use_gumbel_for_cells
self.use_gumbel_for_agg = use_gumbel_for_agg
self.average_approximation_function = average_approximation_function
self.cell_selection_preference = cell_selection_preference
self.answer_loss_cutoff = answer_loss_cutoff
self.max_num_rows = max_num_rows
self.max_num_columns = max_num_columns
self.average_logits_per_cell = average_logits_per_cell
self.select_one_column = select_one_column
self.allow_empty_column_selection = allow_empty_column_selection
self.init_cell_selection_weights_to_zero = init_cell_selection_weights_to_zero
self.reset_position_index_per_cell = reset_position_index_per_cell
self.disable_per_token_loss = disable_per_token_loss
self.scope = scope
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).to(torch_device)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length]).to(torch_device)
token_type_ids = []
for type_vocab_size in self.type_vocab_sizes:
token_type_ids.append(ids_tensor(shape=[self.batch_size, self.seq_length], vocab_size=type_vocab_size))
token_type_ids = torch.stack(token_type_ids, dim=2).to(torch_device)
sequence_labels = None
token_labels = None
labels = None
numeric_values = None
numeric_values_scale = None
float_answer = None
aggregation_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size).to(torch_device)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels).to(torch_device)
labels = ids_tensor([self.batch_size, self.seq_length], vocab_size=2).to(torch_device)
numeric_values = floats_tensor([self.batch_size, self.seq_length]).to(torch_device)
numeric_values_scale = floats_tensor([self.batch_size, self.seq_length]).to(torch_device)
float_answer = floats_tensor([self.batch_size]).to(torch_device)
aggregation_labels = ids_tensor([self.batch_size], self.num_aggregation_labels).to(torch_device)
config = TapasConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
hidden_act=self.hidden_act,
hidden_dropout_prob=self.hidden_dropout_prob,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
type_vocab_sizes=self.type_vocab_sizes,
initializer_range=self.initializer_range,
positive_weight=self.positive_weight,
num_aggregation_labels=self.num_aggregation_labels,
num_labels=self.num_labels,
aggregation_loss_importance=self.aggregation_loss_importance,
use_answer_as_supervision=self.use_answer_as_supervision,
answer_loss_importance=self.answer_loss_importance,
use_normalized_answer_loss=self.use_normalized_answer_loss,
huber_loss_delta=self.huber_loss_delta,
temperature=self.temperature,
agg_temperature=self.agg_temperature,
use_gumbel_for_cells=self.use_gumbel_for_cells,
use_gumbel_for_agg=self.use_gumbel_for_agg,
average_approximation_function=self.average_approximation_function,
cell_selection_preference=self.cell_selection_preference,
answer_loss_cutoff=self.answer_loss_cutoff,
max_num_rows=self.max_num_rows,
max_num_columns=self.max_num_columns,
average_logits_per_cell=self.average_logits_per_cell,
select_one_column=self.select_one_column,
allow_empty_column_selection=self.allow_empty_column_selection,
init_cell_selection_weights_to_zero=self.init_cell_selection_weights_to_zero,
reset_position_index_per_cell=self.reset_position_index_per_cell,
disable_per_token_loss=self.disable_per_token_loss,
)
return (
config,
input_ids,
input_mask,
token_type_ids,
sequence_labels,
token_labels,
labels,
numeric_values,
numeric_values_scale,
float_answer,
aggregation_labels,
)
def create_and_check_model(
self,
config,
input_ids,
input_mask,
token_type_ids,
sequence_labels,
token_labels,
labels,
numeric_values,
numeric_values_scale,
float_answer,
aggregation_labels,
):
model = TapasModel(config=config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
result = model(input_ids, token_type_ids=token_type_ids)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
def create_and_check_for_masked_lm(
self,
config,
input_ids,
input_mask,
token_type_ids,
sequence_labels,
token_labels,
labels,
numeric_values,
numeric_values_scale,
float_answer,
aggregation_labels,
):
model = TapasForMaskedLM(config=config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
def create_and_check_for_question_answering(
self,
config,
input_ids,
input_mask,
token_type_ids,
sequence_labels,
token_labels,
labels,
numeric_values,
numeric_values_scale,
float_answer,
aggregation_labels,
):
# inference: without aggregation head (SQA). Model only returns logits
sqa_config = copy.copy(config)
sqa_config.num_aggregation_labels = 0
sqa_config.use_answer_as_supervision = False
model = TapasForQuestionAnswering(config=sqa_config)
model.to(torch_device)
model.eval()
result = model(
input_ids=input_ids,
attention_mask=input_mask,
token_type_ids=token_type_ids,
)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
# inference: with aggregation head (WTQ, WikiSQL-supervised). Model returns logits and aggregation logits
model = TapasForQuestionAnswering(config=config)
model.to(torch_device)
model.eval()
result = model(
input_ids=input_ids,
attention_mask=input_mask,
token_type_ids=token_type_ids,
)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels))
# training: can happen in 3 main ways
# case 1: conversational (SQA)
model = TapasForQuestionAnswering(config=sqa_config)
model.to(torch_device)
model.eval()
result = model(
input_ids,
attention_mask=input_mask,
token_type_ids=token_type_ids,
labels=labels,
)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
# case 2: weak supervision for aggregation (WTQ)
model = TapasForQuestionAnswering(config=config)
model.to(torch_device)
model.eval()
result = model(
input_ids=input_ids,
attention_mask=input_mask,
token_type_ids=token_type_ids,
labels=labels,
numeric_values=numeric_values,
numeric_values_scale=numeric_values_scale,
float_answer=float_answer,
)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels))
# case 3: strong supervision for aggregation (WikiSQL-supervised)
wikisql_config = copy.copy(config)
wikisql_config.use_answer_as_supervision = False
model = TapasForQuestionAnswering(config=wikisql_config)
model.to(torch_device)
model.eval()
result = model(
input_ids,
attention_mask=input_mask,
token_type_ids=token_type_ids,
labels=labels,
aggregation_labels=aggregation_labels,
)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels))
def create_and_check_for_sequence_classification(
self,
config,
input_ids,
input_mask,
token_type_ids,
sequence_labels,
token_labels,
labels,
numeric_values,
numeric_values_scale,
float_answer,
aggregation_labels,
):
config.num_labels = self.num_labels
model = TapasForSequenceClassification(config)
model.to(torch_device)
model.eval()
result = model(input_ids, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
input_mask,
token_type_ids,
sequence_labels,
token_labels,
labels,
numeric_values,
numeric_values_scale,
float_answer,
aggregation_labels,
) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
@require_scatter
class TapasModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(
TapasModel,
TapasForMaskedLM,
TapasForQuestionAnswering,
TapasForSequenceClassification,
)
if is_torch_available()
else None
)
test_pruning = False
test_torchscript = False
test_resize_embeddings = True
test_head_masking = False
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = copy.deepcopy(inputs_dict)
if model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.values():
inputs_dict = {
k: v.unsqueeze(1).expand(-1, self.model_tester.num_choices, -1).contiguous()
if isinstance(v, torch.Tensor) and v.ndim > 1
else v
for k, v in inputs_dict.items()
}
if return_labels:
if model_class in MODEL_FOR_MULTIPLE_CHOICE_MAPPING.values():
inputs_dict["labels"] = torch.ones(self.model_tester.batch_size, dtype=torch.long, device=torch_device)
elif model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.values():
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
)
inputs_dict["aggregation_labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
inputs_dict["numeric_values"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length),
dtype=torch.float,
device=torch_device,
)
inputs_dict["numeric_values_scale"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length),
dtype=torch.float,
device=torch_device,
)
inputs_dict["float_answer"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.float, device=torch_device
)
elif model_class in [
*MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.values(),
*MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING.values(),
]:
inputs_dict["labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
elif model_class in [
*MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.values(),
*MODEL_FOR_CAUSAL_LM_MAPPING.values(),
*MODEL_FOR_MASKED_LM_MAPPING.values(),
*MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING.values(),
]:
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
)
return inputs_dict
def setUp(self):
self.model_tester = TapasModelTester(self)
self.config_tester = ConfigTester(self, config_class=TapasConfig, dim=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
def test_for_question_answering(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
def test_for_sequence_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
def prepare_tapas_single_inputs_for_inference():
# Here we prepare a single table-question pair to test TAPAS inference on:
data = {
"Footballer": ["Lionel Messi", "Cristiano Ronaldo"],
"Age": ["33", "35"],
}
queries = "Which footballer is 33 years old?"
table = pd.DataFrame.from_dict(data)
return table, queries
def prepare_tapas_batch_inputs_for_inference():
# Here we prepare a batch of 2 table-question pairs to test TAPAS inference on:
data = {
"Footballer": ["Lionel Messi", "Cristiano Ronaldo"],
"Age": ["33", "35"],
"Number of goals": ["712", "750"],
}
queries = ["Which footballer is 33 years old?", "How many goals does Ronaldo have?"]
table = pd.DataFrame.from_dict(data)
return table, queries
def prepare_tapas_batch_inputs_for_training():
# Here we prepare a DIFFERENT batch of 2 table-question pairs to test TAPAS training on:
data = {
"Footballer": ["Lionel Messi", "Cristiano Ronaldo"],
"Age": ["33", "35"],
"Number of goals": ["712", "750"],
}
queries = ["Which footballer is 33 years old?", "What's the total number of goals?"]
table = pd.DataFrame.from_dict(data)
answer_coordinates = [[(0, 0)], [(0, 2), (1, 2)]]
answer_text = [["Lionel Messi"], ["1462"]]
float_answer = [float("NaN"), float("1462")]
return table, queries, answer_coordinates, answer_text, float_answer
TOLERANCE = 1
@require_torch
@require_scatter
class TapasModelIntegrationTest(unittest.TestCase):
@cached_property
def default_tokenizer(self):
return TapasTokenizer.from_pretrained("google/tapas-base-finetuned-wtq")
@slow
def test_inference_no_head(self):
# ideally we want to test this with the weights of tapas_inter_masklm_base_reset,
# but since it's not straightforward to do this with the TF 1 implementation, we test it with
# the weights of the WTQ base model (i.e. tapas_wtq_wikisql_sqa_inter_masklm_base_reset)
model = TapasModel.from_pretrained("google/tapas-base-finetuned-wtq").to(torch_device)
tokenizer = self.default_tokenizer
table, queries = prepare_tapas_single_inputs_for_inference()
inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
inputs = {k: v.to(torch_device) for k, v in inputs.items()}
outputs = model(**inputs)
# test the sequence output
expected_slice = torch.tensor(
[
[
[-0.141581565, -0.599805772, 0.747186482],
[-0.143664181, -0.602008104, 0.749218345],
[-0.15169853, -0.603363097, 0.741370678],
]
],
device=torch_device,
)
self.assertTrue(torch.allclose(outputs.last_hidden_state[:, :3, :3], expected_slice, atol=TOLERANCE))
# test the pooled output
expected_slice = torch.tensor([[0.987518311, -0.970520139, -0.994303405]], device=torch_device)
self.assertTrue(torch.allclose(outputs.pooler_output[:, :3], expected_slice, atol=TOLERANCE))
@unittest.skip(reason="Model not available yet")
def test_inference_masked_lm(self):
pass
# TapasForQuestionAnswering has 3 possible ways of being fine-tuned:
# - conversational set-up (SQA)
# - weak supervision for aggregation (WTQ, WikiSQL)
# - strong supervision for aggregation (WikiSQL-supervised)
# We test all of them:
@slow
def test_inference_question_answering_head_conversational(self):
# note that google/tapas-base-finetuned-sqa should correspond to tapas_sqa_inter_masklm_base_reset
model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-finetuned-sqa").to(torch_device)
tokenizer = self.default_tokenizer
table, queries = prepare_tapas_single_inputs_for_inference()
inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
inputs = {k: v.to(torch_device) for k, v in inputs.items()}
outputs = model(**inputs)
# test the logits
logits = outputs.logits
expected_shape = torch.Size((1, 21))
self.assertEqual(logits.shape, expected_shape)
expected_tensor = torch.tensor(
[
[
-9997.22461,
-9997.22461,
-9997.22461,
-9997.22461,
-9997.22461,
-9997.22461,
-9997.22461,
-9997.22461,
-9997.22461,
-16.2628059,
-10004.082,
15.4330549,
15.4330549,
15.4330549,
-9990.42,
-16.3270779,
-16.3270779,
-16.3270779,
-16.3270779,
-16.3270779,
-10004.8506,
]
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits, expected_tensor, atol=TOLERANCE))
@slow
def test_inference_question_answering_head_conversational_absolute_embeddings(self):
# note that google/tapas-small-finetuned-sqa should correspond to tapas_sqa_inter_masklm_small_reset
# however here we test the version with absolute position embeddings
model = TapasForQuestionAnswering.from_pretrained("google/tapas-small-finetuned-sqa", revision="no_reset").to(
torch_device
)
tokenizer = self.default_tokenizer
table, queries = prepare_tapas_single_inputs_for_inference()
inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
inputs = {k: v.to(torch_device) for k, v in inputs.items()}
outputs = model(**inputs)
# test the logits
logits = outputs.logits
expected_shape = torch.Size((1, 21))
self.assertEqual(logits.shape, expected_shape)
expected_tensor = torch.tensor(
[
[
-10014.7793,
-10014.7793,
-10014.7793,
-10014.7793,
-10014.7793,
-10014.7793,
-10014.7793,
-10014.7793,
-10014.7793,
-18.8419304,
-10018.0391,
17.7848816,
17.7848816,
17.7848816,
-9981.02832,
-16.4005489,
-16.4005489,
-16.4005489,
-16.4005489,
-16.4005489,
-10013.4736,
]
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits, expected_tensor, atol=TOLERANCE))
@slow
def test_inference_question_answering_head_weak_supervision(self):
# note that google/tapas-base-finetuned-wtq should correspond to tapas_wtq_wikisql_sqa_inter_masklm_base_reset
model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wtq").to(torch_device)
tokenizer = self.default_tokenizer
# let's test on a batch
table, queries = prepare_tapas_batch_inputs_for_inference()
inputs = tokenizer(table=table, queries=queries, padding="longest", return_tensors="pt")
inputs_on_device = {k: v.to(torch_device) for k, v in inputs.items()}
outputs = model(**inputs_on_device)
# test the logits
logits = outputs.logits
expected_shape = torch.Size((2, 28))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor(
[
[-160.375504, -160.375504, -160.375504, -10072.3965, -10070.9414, -10094.9736],
[-9861.6123, -9861.6123, -9861.6123, -9861.6123, -9891.01172, 146.600677],
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits[:, -6:], expected_slice, atol=TOLERANCE))
# test the aggregation logits
logits_aggregation = outputs.logits_aggregation
expected_shape = torch.Size((2, 4))
self.assertEqual(logits_aggregation.shape, expected_shape)
expected_tensor = torch.tensor(
[[18.8545208, -9.76614857, -6.3128891, -2.93525243], [-4.05782509, 40.0351, -5.35329962, 23.3978653]],
device=torch_device,
)
self.assertTrue(torch.allclose(logits_aggregation, expected_tensor, atol=TOLERANCE))
# test the predicted answer coordinates and aggregation indices
EXPECTED_PREDICTED_ANSWER_COORDINATES = [[(0, 0)], [(1, 2)]]
EXPECTED_PREDICTED_AGGREGATION_INDICES = [0, 1]
predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
inputs, outputs.logits.detach().cpu(), outputs.logits_aggregation.detach().cpu()
)
self.assertEqual(EXPECTED_PREDICTED_ANSWER_COORDINATES, predicted_answer_coordinates)
self.assertEqual(EXPECTED_PREDICTED_AGGREGATION_INDICES, predicted_aggregation_indices)
@slow
def test_training_question_answering_head_weak_supervision(self):
# note that google/tapas-base-finetuned-wtq should correspond to tapas_wtq_wikisql_sqa_inter_masklm_base_reset
model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wtq").to(torch_device)
model.to(torch_device)
# normally we should put the model in training mode but it's a pain to do this with the TF 1 implementation
tokenizer = self.default_tokenizer
# let's test on a batch
table, queries, answer_coordinates, answer_text, float_answer = prepare_tapas_batch_inputs_for_training()
inputs = tokenizer(
table=table,
queries=queries,
answer_coordinates=answer_coordinates,
answer_text=answer_text,
padding="longest",
return_tensors="pt",
)
# prepare data (created by the tokenizer) and move to torch_device
input_ids = inputs["input_ids"].to(torch_device)
attention_mask = inputs["attention_mask"].to(torch_device)
token_type_ids = inputs["token_type_ids"].to(torch_device)
labels = inputs["labels"].to(torch_device)
numeric_values = inputs["numeric_values"].to(torch_device)
numeric_values_scale = inputs["numeric_values_scale"].to(torch_device)
# the answer should be prepared by the user
float_answer = torch.FloatTensor(float_answer).to(torch_device)
# forward pass to get loss + logits:
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels,
numeric_values=numeric_values,
numeric_values_scale=numeric_values_scale,
float_answer=float_answer,
)
# test the loss
loss = outputs.loss
expected_loss = torch.tensor(3.3527612686157227e-08, device=torch_device)
self.assertTrue(torch.allclose(loss, expected_loss, atol=TOLERANCE))
# test the logits on the first example
logits = outputs.logits
expected_shape = torch.Size((2, 29))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor(
[
-160.0156,
-160.0156,
-160.0156,
-160.0156,
-160.0156,
-10072.2266,
-10070.8896,
-10092.6006,
-10092.6006,
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits[0, -9:], expected_slice, atol=TOLERANCE))
# test the aggregation logits on the second example
logits_aggregation = outputs.logits_aggregation
expected_shape = torch.Size((2, 4))
self.assertEqual(logits_aggregation.shape, expected_shape)
expected_slice = torch.tensor([-4.0538, 40.0304, -5.3554, 23.3965], device=torch_device)
self.assertTrue(torch.allclose(logits_aggregation[1, -4:], expected_slice, atol=TOLERANCE))
@slow
def test_inference_question_answering_head_strong_supervision(self):
# note that google/tapas-base-finetuned-wikisql-supervised should correspond to tapas_wikisql_sqa_inter_masklm_base_reset
model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wikisql-supervised").to(
torch_device
)
tokenizer = self.default_tokenizer
table, queries = prepare_tapas_single_inputs_for_inference()
inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
inputs = {k: v.to(torch_device) for k, v in inputs.items()}
outputs = model(**inputs)
# test the logits
logits = outputs.logits
expected_shape = torch.Size((1, 21))
self.assertEqual(logits.shape, expected_shape)
expected_tensor = torch.tensor(
[
[
-10011.1084,
-10011.1084,
-10011.1084,
-10011.1084,
-10011.1084,
-10011.1084,
-10011.1084,
-10011.1084,
-10011.1084,
-18.6185989,
-10008.7969,
17.6355762,
17.6355762,
17.6355762,
-10002.4404,
-18.7111301,
-18.7111301,
-18.7111301,
-18.7111301,
-18.7111301,
-10007.0977,
]
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits, expected_tensor, atol=TOLERANCE))
# test the aggregation logits
logits_aggregation = outputs.logits_aggregation
expected_shape = torch.Size((1, 4))
self.assertEqual(logits_aggregation.shape, expected_shape)
expected_tensor = torch.tensor(
[[16.5659733, -3.06624889, -2.34152961, -0.970244825]], device=torch_device
) # PyTorch model outputs [[16.5679, -3.0668, -2.3442, -0.9674]]
self.assertTrue(torch.allclose(logits_aggregation, expected_tensor, atol=TOLERANCE))
@slow
def test_inference_classification_head(self):
# note that google/tapas-base-finetuned-tabfact should correspond to tapas_tabfact_inter_masklm_base_reset
model = TapasForSequenceClassification.from_pretrained("google/tapas-base-finetuned-tabfact").to(torch_device)
tokenizer = self.default_tokenizer
table, queries = prepare_tapas_single_inputs_for_inference()
inputs = tokenizer(table=table, queries=queries, padding="longest", return_tensors="pt")
inputs = {k: v.to(torch_device) for k, v in inputs.items()}
outputs = model(**inputs)
# test the classification logits
logits = outputs.logits
expected_shape = torch.Size((1, 2))
self.assertEqual(logits.shape, expected_shape)
expected_tensor = torch.tensor(
[[0.795137286, 9.5572]], device=torch_device
) # Note that the PyTorch model outputs [[0.8057, 9.5281]]
self.assertTrue(torch.allclose(outputs.logits, expected_tensor, atol=TOLERANCE))
# Below: tests for Tapas utilities which are defined in modeling_tapas.py.
# These are based on segmented_tensor_test.py of the original implementation.
# URL: https://github.com/google-research/tapas/blob/master/tapas/models/segmented_tensor_test.py
@require_scatter
class TapasUtilitiesTest(unittest.TestCase):
def _prepare_tables(self):
"""Prepares two tables, both with three distinct rows.
The first table has two columns:
1.0, 2.0 | 3.0
2.0, 0.0 | 1.0
1.0, 3.0 | 4.0
The second table has three columns:
1.0 | 2.0 | 3.0
2.0 | 0.0 | 1.0
1.0 | 3.0 | 4.0
Returns:
SegmentedTensors with the tables.
"""
values = torch.tensor(
[
[[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]],
[[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]],
]
)
row_index = IndexMap(
indices=torch.tensor(
[
[[0, 0, 0], [1, 1, 1], [2, 2, 2]],
[[0, 0, 0], [1, 1, 1], [2, 2, 2]],
]
),
num_segments=3,
batch_dims=1,
)
col_index = IndexMap(
indices=torch.tensor(
[
[[0, 0, 1], [0, 0, 1], [0, 0, 1]],
[[0, 1, 2], [0, 1, 2], [0, 1, 2]],
]
),
num_segments=3,
batch_dims=1,
)
return values, row_index, col_index
def test_product_index(self):
_, row_index, col_index = self._prepare_tables()
cell_index = ProductIndexMap(row_index, col_index)
row_index_proj = cell_index.project_outer(cell_index)
col_index_proj = cell_index.project_inner(cell_index)
ind = cell_index.indices
self.assertEqual(cell_index.num_segments, 9)
# Projections should give back the original indices.
# we use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(row_index.indices.numpy(), row_index_proj.indices.numpy())
self.assertEqual(row_index.num_segments, row_index_proj.num_segments)
self.assertEqual(row_index.batch_dims, row_index_proj.batch_dims)
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(col_index.indices.numpy(), col_index_proj.indices.numpy())
self.assertEqual(col_index.batch_dims, col_index_proj.batch_dims)
# The first and second "column" are identified in the first table.
for i in range(3):
self.assertEqual(ind[0, i, 0], ind[0, i, 1])
self.assertNotEqual(ind[0, i, 0], ind[0, i, 2])
# All rows are distinct in the first table.
for i, i_2 in zip(range(3), range(3)):
for j, j_2 in zip(range(3), range(3)):
if i != i_2 and j != j_2:
self.assertNotEqual(ind[0, i, j], ind[0, i_2, j_2])
# All cells are distinct in the second table.
for i, i_2 in zip(range(3), range(3)):
for j, j_2 in zip(range(3), range(3)):
if i != i_2 or j != j_2:
self.assertNotEqual(ind[1, i, j], ind[1, i_2, j_2])
def test_flatten(self):
_, row_index, col_index = self._prepare_tables()
row_index_flat = flatten(row_index)
col_index_flat = flatten(col_index)
shape = [3, 4, 5]
batched_index = IndexMap(indices=torch.zeros(shape).type(torch.LongTensor), num_segments=1, batch_dims=3)
batched_index_flat = flatten(batched_index)
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(
row_index_flat.indices.numpy(), [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5]
)
np.testing.assert_array_equal(
col_index_flat.indices.numpy(), [0, 0, 1, 0, 0, 1, 0, 0, 1, 3, 4, 5, 3, 4, 5, 3, 4, 5]
)
self.assertEqual(batched_index_flat.num_segments.numpy(), np.prod(shape))
np.testing.assert_array_equal(batched_index_flat.indices.numpy(), range(np.prod(shape)))
def test_range_index_map(self):
batch_shape = [3, 4]
num_segments = 5
index = range_index_map(batch_shape, num_segments)
self.assertEqual(num_segments, index.num_segments)
self.assertEqual(2, index.batch_dims)
indices = index.indices
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(list(indices.size()), [3, 4, 5])
for i in range(batch_shape[0]):
for j in range(batch_shape[1]):
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(indices[i, j, :].numpy(), range(num_segments))
def test_reduce_sum(self):
values, row_index, col_index = self._prepare_tables()
cell_index = ProductIndexMap(row_index, col_index)
row_sum, _ = reduce_sum(values, row_index)
col_sum, _ = reduce_sum(values, col_index)
cell_sum, _ = reduce_sum(values, cell_index)
# We use np.testing.assert_allclose rather than Tensorflow's assertAllClose
np.testing.assert_allclose(row_sum.numpy(), [[6.0, 3.0, 8.0], [6.0, 3.0, 8.0]])
np.testing.assert_allclose(col_sum.numpy(), [[9.0, 8.0, 0.0], [4.0, 5.0, 8.0]])
np.testing.assert_allclose(
cell_sum.numpy(),
[[3.0, 3.0, 0.0, 2.0, 1.0, 0.0, 4.0, 4.0, 0.0], [1.0, 2.0, 3.0, 2.0, 0.0, 1.0, 1.0, 3.0, 4.0]],
)
def test_reduce_mean(self):
values, row_index, col_index = self._prepare_tables()
cell_index = ProductIndexMap(row_index, col_index)
row_mean, _ = reduce_mean(values, row_index)
col_mean, _ = reduce_mean(values, col_index)
cell_mean, _ = reduce_mean(values, cell_index)
# We use np.testing.assert_allclose rather than Tensorflow's assertAllClose
np.testing.assert_allclose(
row_mean.numpy(), [[6.0 / 3.0, 3.0 / 3.0, 8.0 / 3.0], [6.0 / 3.0, 3.0 / 3.0, 8.0 / 3.0]]
)
np.testing.assert_allclose(col_mean.numpy(), [[9.0 / 6.0, 8.0 / 3.0, 0.0], [4.0 / 3.0, 5.0 / 3.0, 8.0 / 3.0]])
np.testing.assert_allclose(
cell_mean.numpy(),
[
[3.0 / 2.0, 3.0, 0.0, 2.0 / 2.0, 1.0, 0.0, 4.0 / 2.0, 4.0, 0.0],
[1.0, 2.0, 3.0, 2.0, 0.0, 1.0, 1.0, 3.0, 4.0],
],
)
def test_reduce_max(self):
values = torch.as_tensor([2.0, 1.0, 0.0, 3.0])
index = IndexMap(indices=torch.as_tensor([0, 1, 0, 1]), num_segments=2)
maximum, _ = reduce_max(values, index)
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(maximum.numpy(), [2, 3])
def test_reduce_sum_vectorized(self):
values = torch.as_tensor([[1.0, 2.0, 3.0], [2.0, 3.0, 4.0], [3.0, 4.0, 5.0]])
index = IndexMap(indices=torch.as_tensor([0, 0, 1]), num_segments=2, batch_dims=0)
sums, new_index = reduce_sum(values, index)
# We use np.testing.assert_allclose rather than Tensorflow's assertAllClose
np.testing.assert_allclose(sums.numpy(), [[3.0, 5.0, 7.0], [3.0, 4.0, 5.0]])
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(new_index.indices.numpy(), [0, 1])
np.testing.assert_array_equal(new_index.num_segments.numpy(), 2)
np.testing.assert_array_equal(new_index.batch_dims, 0)
def test_gather(self):
values, row_index, col_index = self._prepare_tables()
cell_index = ProductIndexMap(row_index, col_index)
# Compute sums and then gather. The result should have the same shape as
# the original table and each element should contain the sum the values in
# its cell.
sums, _ = reduce_sum(values, cell_index)
cell_sum = gather(sums, cell_index)
assert cell_sum.size() == values.size()
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_allclose(
cell_sum.numpy(),
[[[3.0, 3.0, 3.0], [2.0, 2.0, 1.0], [4.0, 4.0, 4.0]], [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]]],
)
def test_gather_vectorized(self):
values = torch.as_tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
index = IndexMap(indices=torch.as_tensor([[0, 1], [1, 0]]), num_segments=2, batch_dims=1)
result = gather(values, index)
# We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
np.testing.assert_array_equal(result.numpy(), [[[1, 2], [3, 4]], [[7, 8], [5, 6]]])
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment