1. 17 Aug, 2023 1 commit
  2. 27 Jul, 2023 1 commit
  3. 11 Jul, 2023 1 commit
  4. 30 Jun, 2023 1 commit
  5. 06 Feb, 2023 1 commit
    • Sylvain Gugger's avatar
      Update quality tooling for formatting (#21480) · 6f79d264
      Sylvain Gugger authored
      * Result of black 23.1
      
      * Update target to Python 3.7
      
      * Switch flake8 to ruff
      
      * Configure isort
      
      * Configure isort
      
      * Apply isort with line limit
      
      * Put the right black version
      
      * adapt black in check copies
      
      * Fix copies
      6f79d264
  6. 18 Jan, 2023 1 commit
  7. 23 Nov, 2022 1 commit
    • raghavanone's avatar
      change the way sentinel tokens can retrived (#20373) · 03ae1f06
      raghavanone authored
      * change the way sentinel tokens can retrived
      
      * Fix line length for doc string
      
      * Fix line length for doc string
      
      * Add more stronger test for t5 tokenization
      
      * Format file changes
      
      * Make a stronger test for filtering sentinel tokens
      
      * fix file format issues
      03ae1f06
  8. 02 Nov, 2022 1 commit
    • Ben Eyal's avatar
      🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in... · 9f9ddcc2
      Ben Eyal authored
      🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` (#15775)
      
      * Add test for SentencePiece not adding special tokens to strings
      
      * Add SentencePieceStringConversionMixin to fix issue 15003
      
      * Fix conversion from tokens to string for most SentencePiece tokenizers
      
      Tokenizers fixed:
      - AlbertTokenizer
      - BarthezTokenizer
      - CamembertTokenizer
      - FNetTokenizer
      - M2M100Tokenizer
      - MBart50Tokenizer
      - PegasusTokenizer
      - Speech2TextTokenizer
      
      * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab
      
      * Fix DebertaV2Tokenizer
      
      * Ignore LayoutXLMTokenizer in SentencePiece string conversion test
      
      * Run 'make style' and 'make quality'
      
      * Clean convert_tokens_to_string test
      
      Instead of explicitly ignoring LayoutXLMTokenizer in the test,
      override the test in LayoutLMTokenizationTest and do nothing in it.
      
      * Remove commented out code
      
      * Improve robustness of convert_tokens_to_string test
      
      Instead of comparing lengths of re-tokenized text and input_ids,
      check that converting all special tokens to string yields a string
      with all special tokens.
      
      * Inline and remove SentencePieceStringConversionMixin
      
      The convert_tokens_to_string method is now implemented
      in each relevant SentencePiece tokenizer.
      
      * Run 'make style' and 'make quality'
      
      * Revert removal of space in convert_tokens_to_string
      
      * Remove redundant import
      
      * Revert test text to original
      
      * Uncomment the lowercasing of the reverse_text variable
      
      * Mimic Rust tokenizer behavior for tokenizers
      
      - Albert
      - Barthez
      - Camembert
      - MBart50
      - T5
      
      * Fix accidentally skipping test in wrong tokenizer
      
      * Add test for equivalent Rust and slow tokenizer behavior
      
      * Override _decode in BigBirdTokenizer to mimic Rust behavior
      
      * Override _decode in FNetTokenizer to mimic Rust behavior
      
      * Override _decode in XLNetTokenizer to mimic Rust behavior
      
      * Remove unused 're' import
      
      * Update DebertaV2Tokenizer to mimic Rust tokenizer
      
      * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.
      
      * Ignore problematic tests in Deberta V2
      
      * Add comment on why the Deberta V2 tests are skipped
      9f9ddcc2
  9. 12 May, 2022 1 commit
  10. 02 May, 2022 1 commit
  11. 27 Jan, 2022 1 commit
    • SaulLu's avatar
      improve saving strategy of sentencepiece tokenizer (#15328) · ade7371a
      SaulLu authored
      
      
      * add new test
      
      * add a feature to same the sentencepiece tokenizer model when the init file was deleted
      
      * update marian
      
      * update m2m_100
      
      * fix marian
      
      * update speech to text
      
      * override test for layoutxlm
      
      * fix saving bartpho
      
      * remove harcoded values bartpho
      
      * special token string version
      
      * finish bartpho
      
      * override layoutxml test
      
      * add mbart
      
      * move special tokens list
      
      * format
      
      * Revert "format"
      
      This reverts commit 37a40df37903a932c2f951cbd33acb684246bae7.
      
      * simplify list of string of special tokens
      
      * Re-write `self.fairseq_tokens_to_ids ` initialization logic with special tokens
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      ade7371a
  12. 27 Dec, 2021 1 commit
    • Sylvain Gugger's avatar
      Doc styler v2 (#14950) · 87e6e4fe
      Sylvain Gugger authored
      * New doc styler
      
      * Fix issue with args at the start
      
      * Code sample fixes
      
      * Style code examples in MDX
      
      * Fix more patterns
      
      * Typo
      
      * Typo
      
      * More patterns
      
      * Do without black for now
      
      * Get more info in error
      
      * Docstring style
      
      * Re-enable check
      
      * Quality
      
      * Fix add_end_docstring decorator
      
      * Fix docstring
      87e6e4fe
  13. 21 Dec, 2021 1 commit
    • Sylvain Gugger's avatar
      Mass conversion of documentation from rst to Markdown (#14866) · 27b3031d
      Sylvain Gugger authored
      * Convert docstrings of all configurations and tokenizers
      
      * Processors and fixes
      
      * Last modeling files and fixes to models
      
      * Pipeline modules
      
      * Utils files
      
      * Data submodule
      
      * All the other files
      
      * Style
      
      * Missing examples
      
      * Style again
      
      * Fix copies
      
      * Say bye bye to rst docstrings forever
      27b3031d
  14. 13 May, 2021 1 commit
    • Philip May's avatar
      Enable option for subword regularization in more tokenizers. (#11417) · 37ed3ab7
      Philip May authored
      * improve slow class tok usage at xlm rob
      
      * add subword regularization for barthez
      
      * improve barthez tok. test
      
      * fix tokenizer tests
      
      * add subword regularization for camembert
      
      * add subword regularization for deberta v2 tokenizer
      
      * add more doc to deberta v2 tokenizer
      
      * add subword regularization for speech to text tok.
      
      * fix sp_model_kwargs type in speech 2 text tok.
      
      * add subword regularization for M2M100 tok.
      
      * add more concrete type hints
      
      * fix tests for m2m100 and s2t tok.
      
      * add missing Any import
      
      * fix syntax error in m2m100 tok.
      
      * fix unpickle of m2m100 and s2t tok.
      
      * fix test of m2m100 and s2t tok.
      
      * improve unpickle of deberta v2 tok.
      
      * add test for pickle of barthez & camembert
      
      * fix pickle of barthez & camembert
      
      * add test for deberta v2 tok. pickle
      
      * fix m2m100 tok. pickle
      
      * fix s2t tok. pickle
      
      * add subword regularization to albert tok.
      
      * refactor subword reg. test into TokenizerTesterMixin
      
      improve albert tok. test
      
      remove sample argument form albert tok.
      
      check subword reg. using TokenizerTesterMixin
      
      improve tok. tests
      
      improve xlm roberta tok. tests
      
      improve xlm roberta tok. tests
      
      * add subword regularization for big bird t.
      
      * improve xlm roberta tok. test
      
      * add subword regularization for mbart50 tok.
      
      * add subword regularization for pegasus tok.
      
      * add subword regularization for reformer tok.
      
      * add subword regularization for T5 tok.
      
      * fix t5 tok. test formatting
      
      * add subword regularization for xlm_proph. tok.
      
      * add subword regularization for xlnet tok.
      
      * add subword regularization for gert_gen tok.
      
      * add typing to tokenizers
      
      * add typing to xlm rob. tok
      
      * add subword regularization for marian tok.
      
      * add reverse tok. test
      
      * fix marian tok test
      
      * fix marian tok test
      
      * fix casing in tok. tests
      
      * fix style of tok. common test
      
      * fix deberta v2 tok test
      
      * add type annotations to tok. tests
      
      * add type annotations to tok. __init__
      
      * add typing to kokenizer
      
      * add type annotations to tok. __init__
      
      * don't specify the default when it's None
      
      * fix barthez tok. doc
      
      * move sentencepiece tok. tests to TokenizerTesterMixin
      
      * fix unused imports
      
      * fix albert tok. test
      
      * add comment to sentencepiece test options
      
      * fix Any import at big bird tok.
      
      * fix Any import at xlm prophetnet tok.
      
      * empty commit to trigger CI
      37ed3ab7
  15. 04 May, 2021 1 commit
  16. 26 Apr, 2021 1 commit
  17. 09 Apr, 2021 1 commit
  18. 31 Mar, 2021 1 commit
  19. 10 Mar, 2021 1 commit
  20. 13 Feb, 2021 1 commit
  21. 04 Feb, 2021 1 commit
  22. 02 Feb, 2021 1 commit
  23. 12 Jan, 2021 1 commit
    • Sylvain Gugger's avatar
      Refactor `prepare_seq2seq_batch` (#9524) · 063d8d27
      Sylvain Gugger authored
      * Add target contextmanager and rework prepare_seq2seq_batch
      
      * Fix tests, treat BART and Barthez
      
      * Add last tokenizers
      
      * Fix test
      
      * Set src token before calling the superclass
      
      * Remove special behavior for T5
      
      * Remove needless imports
      
      * Remove needless asserts
      063d8d27
  24. 19 Nov, 2020 1 commit
  25. 17 Nov, 2020 2 commits
    • Julien Chaumond's avatar
      Tokenizers: ability to load from model subfolder (#8586) · 042a6aa7
      Julien Chaumond authored
      
      
      * <small>tiny typo</small>
      
      * Tokenizers: ability to load from model subfolder
      
      * use subfolder for local files as well
      
      * Uniformize model shortcut name => model id
      
      * from s3 => from huggingface.co
      Co-authored-by: default avatarQuentin Lhoest <lhoest.q@gmail.com>
      042a6aa7
    • Sylvain Gugger's avatar
      Reorganize repo (#8580) · c89bdfbe
      Sylvain Gugger authored
      * Put models in subfolders
      
      * Styling
      
      * Fix imports in tests
      
      * More fixes in test imports
      
      * Sneaky hidden imports
      
      * Fix imports in doc files
      
      * More sneaky imports
      
      * Finish fixing tests
      
      * Fix examples
      
      * Fix path for copies
      
      * More fixes for examples
      
      * Fix dummy files
      
      * More fixes for example
      
      * More model import fixes
      
      * Is this why you're unhappy GitHub?
      
      * Fix imports in conver command
      c89bdfbe
  26. 10 Nov, 2020 3 commits
  27. 29 Oct, 2020 1 commit
  28. 26 Oct, 2020 2 commits
    • Sylvain Gugger's avatar
      Doc styling (#8067) · 08f534d2
      Sylvain Gugger authored
      * Important files
      
      * Styling them all
      
      * Revert "Styling them all"
      
      This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.
      
      * Syling them for realsies
      
      * Fix syntax error
      
      * Fix benchmark_utils
      
      * More fixes
      
      * Fix modeling auto and script
      
      * Remove new line
      
      * Fixes
      
      * More fixes
      
      * Fix more files
      
      * Style
      
      * Add FSMT
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * More fixes
      
      * Fixes
      
      * More fixes
      
      * More fixes
      
      * Last fixes
      
      * Make sphinx happy
      08f534d2
    • Thomas Wolf's avatar
      [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization (#8006) · 79eb3915
      Thomas Wolf authored
      * fixing #8001
      
      * make T5 tokenizer serialization more robust - style
      79eb3915
  29. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 🎉
      
      * and removed hard dependency on tokenizers 🎉
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  30. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  31. 23 Sep, 2020 1 commit
  32. 15 Sep, 2020 1 commit
  33. 11 Sep, 2020 1 commit
  34. 10 Sep, 2020 1 commit
    • Patrick von Platen's avatar
      Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. (#6594) · 7fd1febf
      Patrick von Platen authored
      * add conversion script
      
      * improve conversion script
      
      * make style
      
      * add tryout files
      
      * fix
      
      * update
      
      * add causal bert
      
      * better names
      
      * add tokenizer file as well
      
      * finish causal_bert
      
      * fix small bugs
      
      * improve generate
      
      * change naming
      
      * renaming
      
      * renaming
      
      * renaming
      
      * remove leftover files
      
      * clean files
      
      * add fix tokenizer
      
      * finalize
      
      * correct slow test
      
      * update docs
      
      * small fixes
      
      * fix link
      
      * adapt check repo
      
      * apply sams and sylvains recommendations
      
      * fix import
      
      * implement Lysandres recommendations
      
      * fix logger warn
      7fd1febf
  35. 04 Sep, 2020 1 commit
  36. 28 Aug, 2020 1 commit