improve saving strategy of sentencepiece tokenizer (#15328)

* add new test * add a feature to same the sentencepiece tokenizer model when the init file was deleted * update marian * update m2m_100 * fix marian * update speech to text * override test for layoutxlm * fix saving bartpho * remove harcoded values bartpho * special token string version * finish bartpho * override layoutxml test * add mbart * move special tokens list * format * Revert "format" This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7. * simplify list of string of special tokens * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

improve saving strategy of sentencepiece tokenizer (#15328)
* add new test * add a feature to same the sentencepiece tokenizer model when the init file was deleted * update marian * update m2m_100 * fix marian * update speech to text * override test for layoutxlm * fix saving bartpho * remove harcoded values bartpho * special token string version * finish bartpho * override layoutxml test * add mbart * move special tokens list * format * Revert "format" This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7. * simplify list of string of special tokens * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
ade7371a · SaulLu · GitHub · 196cce6e · ade7371a
Unverified Commit ade7371a authored Jan 27, 2022 by SaulLu Committed by GitHub Jan 27, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 0 deletions

tests/test_tokenization_mbart.py tests/test_tokenization_mbart.py +1 -0

No files found.
--- a/tests/test_tokenization_mbart.py
+++ b/tests/test_tokenization_mbart.py
@@ -39,6 +39,7 @@ class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
    tokenizer_class = MBartTokenizer
    rust_tokenizer_class = MBartTokenizerFast
    test_rust_tokenizer = True
+    test_sentencepiece = True

    def setUp(self):
        super().setUp()