Unverified Commit 1551e2dc authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

[WIP] Tapas v4 (tres) (#9117)



* First commit: adding all files from tapas_v3

* Fix multiple bugs including soft dependency and new structure of the library

* Improve testing by adding torch_device to inputs and adding dependency on scatter

* Use Python 3 inheritance rather than Python 2

* First draft model cards of base sized models

* Remove model cards as they are already on the hub

* Fix multiple bugs with integration tests

* All model integration tests pass

* Remove print statement

* Add test for convert_logits_to_predictions method of TapasTokenizer

* Incorporate suggestions by Google authors

* Fix remaining tests

* Change position embeddings sizes to 512 instead of 1024

* Comment out positional embedding sizes

* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

* Added more model names

* Fix truncation when no max length is specified

* Disable torchscript test

* Make style & make quality

* Quality

* Address CI needs

* Test the Masked LM model

* Fix the masked LM model

* Truncate when overflowing

* More much needed docs improvements

* Fix some URLs

* Some more docs improvements

* Test PyTorch scatter

* Set to slow + minify

* Calm flake8 down

* First commit: adding all files from tapas_v3

* Fix multiple bugs including soft dependency and new structure of the library

* Improve testing by adding torch_device to inputs and adding dependency on scatter

* Use Python 3 inheritance rather than Python 2

* First draft model cards of base sized models

* Remove model cards as they are already on the hub

* Fix multiple bugs with integration tests

* All model integration tests pass

* Remove print statement

* Add test for convert_logits_to_predictions method of TapasTokenizer

* Incorporate suggestions by Google authors

* Fix remaining tests

* Change position embeddings sizes to 512 instead of 1024

* Comment out positional embedding sizes

* Update PRETRAINED_VOCAB_FILES_MAP and PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

* Added more model names

* Fix truncation when no max length is specified

* Disable torchscript test

* Make style & make quality

* Quality

* Address CI needs

* Test the Masked LM model

* Fix the masked LM model

* Truncate when overflowing

* More much needed docs improvements

* Fix some URLs

* Some more docs improvements

* Add add_pooling_layer argument to TapasModel

Fix comments by @sgugger and @patrickvonplaten

* Fix issue in docs + fix style and quality

* Clean up conversion script and add task parameter to TapasConfig

* Revert the task parameter of TapasConfig

Some minor fixes

* Improve conversion script and add test for absolute position embeddings

* Improve conversion script and add test for absolute position embeddings

* Fix bug with reset_position_index_per_cell arg of the conversion cli

* Add notebooks to the examples directory and fix style and quality

* Apply suggestions from code review

* Move from `nielsr/` to `google/` namespace

* Apply Sylvain's comments
Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
Co-authored-by: default avatarRogge Niels <niels.rogge@howest.be>
Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
Co-authored-by: default avatarsgugger <sylvain.gugger@gmail.com>
parent ad895af9
...@@ -584,7 +584,7 @@ class TokenizerTesterMixin: ...@@ -584,7 +584,7 @@ class TokenizerTesterMixin:
# We want to have sequence 0 and sequence 1 are tagged # We want to have sequence 0 and sequence 1 are tagged
# respectively with 0 and 1 token_ids # respectively with 0 and 1 token_ids
# (regardeless of weither the model use token type ids) # (regardless of whether the model use token type ids)
# We use this assumption in the QA pipeline among other place # We use this assumption in the QA pipeline among other place
output = tokenizer(seq_0, return_token_type_ids=True) output = tokenizer(seq_0, return_token_type_ids=True)
self.assertIn(0, output["token_type_ids"]) self.assertIn(0, output["token_type_ids"])
...@@ -600,7 +600,7 @@ class TokenizerTesterMixin: ...@@ -600,7 +600,7 @@ class TokenizerTesterMixin:
# We want to have sequence 0 and sequence 1 are tagged # We want to have sequence 0 and sequence 1 are tagged
# respectively with 0 and 1 token_ids # respectively with 0 and 1 token_ids
# (regardeless of weither the model use token type ids) # (regardless of whether the model use token type ids)
# We use this assumption in the QA pipeline among other place # We use this assumption in the QA pipeline among other place
output = tokenizer(seq_0) output = tokenizer(seq_0)
self.assertIn(0, output.sequence_ids()) self.assertIn(0, output.sequence_ids())
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment