1. 26 Jun, 2024 1 commit
  2. 13 Mar, 2024 1 commit
  3. 18 Sep, 2023 1 commit
    • Arthur's avatar
      🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909) · 2da88537
      Arthur authored
      
      
      * fix test for bart. Order is correct now let's skip BPEs
      
      * ouf
      
      * styling
      
      * fix bert....
      
      * slow refactoring
      
      * current updates
      
      * massive refactoring
      
      * update
      
      * NICE!
      
      * update to see where I am at
      
      * updates
      
      * update
      
      * update
      
      * revert
      
      * updates
      
      * updates
      
      * start supporting legacy_save
      
      * styling
      
      * big update
      
      * revert some changes
      
      * nits
      
      * nniiiiiice
      
      * small fixes
      
      * kinda fix t5 with new behaviour
      
      * major update
      
      * fixup
      
      * fix copies
      
      * today's updates
      
      * fix byt5
      
      * upfate
      
      * update
      
      * update
      
      * updates
      
      * update vocab size test
      
      * Barthez does not use not need the fairseq offset ids
      
      * super calll must be after
      
      * calll super
      
      * move all super init
      
      * move other super init
      
      * fixup
      
      * nits
      
      * more fixes
      
      * nits
      
      * more fixes
      
      * nits
      
      * more fix
      
      * remove useless files
      
      * ouch all of them are affected
      
      * and more!
      
      * small imporvements
      
      * no more sanitize token
      
      * more changes around unique no split tokens
      
      * partially fix more things
      
      * keep legacy save but add warning
      
      * so... more fixes
      
      * updates
      
      * guess deberta tokenizer could be nuked
      
      * fixup
      
      * fixup did some bad things
      
      * nuke it if it breaks
      
      * remove prints and pretrain fast from slow with new format.
      
      * fixups
      
      * Apply suggestions from code review
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * fiou
      
      * nit
      
      * by default specials should not be normalized?
      
      * update
      
      * remove brakpoint
      
      * updates
      
      * a lot of updates
      
      * fixup
      
      * fixes revert some changes to match fast
      
      * small nits
      
      * that makes it cleaner
      
      * fix camembert accordingly
      
      * update
      
      * some lest breaking changes
      
      * update
      
      * fixup
      
      * fix byt5 and whisper mostly
      
      * some more fixes, canine's byte vocab
      
      * fix gpt2
      
      * fix most of the perceiver tests (4 left)
      
      * fix layout lmv3
      
      * fixup
      
      * fix copies for gpt2 style
      
      * make sure to only warn once
      
      * fix perciever and gpt2 tests
      
      * some more backward compatibility: also read special tokens map because some ppl use it........////.....
      
      * fixup
      
      * add else when reading
      
      * nits
      
      * fresh updates
      
      * fix copies
      
      * will this make everything faster?
      
      * fixes
      
      * more fixes
      
      * update
      
      * more fixes
      
      * fixup
      
      * is the source of truth right?
      
      * sorry camembert for the troubles
      
      * current updates
      
      * fixup
      
      * update led
      
      * update
      
      * fix regression
      
      * fix single word
      
      * more model specific fixes
      
      * fix t5 tests
      
      * fixup
      
      * more comments
      
      * update
      
      * fix nllb
      
      * rstrip removed
      
      * small fixes
      
      * better handle additional_special_tokens and vocab sizes
      
      * fixing
      
      * styling
      
      * fix 4 / 21
      
      * fixup
      
      * fix nlbb's tests
      
      * some fixes
      
      * fix t5
      
      * fixes
      
      * style
      
      * fix canine tests
      
      * damn this is nice
      
      * nits
      
      * m2m100 nit
      
      * fixups
      
      * fixes!
      
      * fixup
      
      * stash
      
      * fix merge
      
      * revert bad change
      
      * fixup
      
      * correct order for code Llama
      
      * fix speecht5 post merge
      
      * styling
      
      * revert source of 11 fails
      
      * small nits
      
      * all changes in one go
      
      * fnet hack
      
      * fix 2 more tests
      
      * update based on main branch of tokenizers
      
      * fixup
      
      * fix VITS issues
      
      * more fixes
      
      * fix mgp test
      
      * fix camembert issues
      
      * oups camembert still has 2 failing tests
      
      * mluke fixes
      
      * decode fixes
      
      * small nits
      
      * nits
      
      * fix llama and vits
      
      * fix camembert
      
      * smal nits
      
      * more fixes when initialising a fast from a slow and etc
      
      * fix one of the last test
      
      * fix CPM tokenizer test
      
      * fixups
      
      * fix pop2piano
      
      * fixup
      
      * ️ Change tokenizers required version ️
      
      * ️ Change tokenizers required version ️
      
      * "tokenizers>=0.14,<0.15", don't forget smaller than
      
      * fix musicgen tests and pretraiendtokenizerfast
      
      * fix owlvit and all
      
      * update t5
      
      * fix 800 red
      
      * fix tests
      
      * fix the fix of the fix of t5
      
      * styling
      
      * documentation nits
      
      * cache _added_tokens_encoder
      
      * fixups
      
      * Nit
      
      * fix red tests
      
      * one last nit!
      
      * make eveything a lot simpler
      
      * Now it's over 😉
      
      
      
      * few small nits
      
      * Apply suggestions from code review
      Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
      
      * updates that work for now
      
      * tests that should no be skipped / changed and fixed next
      
      * fixup
      
      * i am ashamed
      
      * pushe the fix
      
      * update
      
      * fixups
      
      * nits
      
      * fix added_tokens_encoder
      
      * fix canine test
      
      * fix pegasus vocab
      
      * fix transfoXL
      
      * fixup
      
      * whisper needs to be fixed for train new
      
      * pegasus nits
      
      * more pegasus fixes
      
      * minor update
      
      * better error message in failed test
      
      * fix whisper failing test
      
      * fix whisper failing test
      
      * fix pegasus
      
      * fixup
      
      * fix **** pegasus
      
      * reset things
      
      * remove another file
      
      * attempts to fix the strange custome encoder and offset
      
      * nits here and there
      
      * update
      
      * fixup
      
      * nit
      
      * fix the whisper test
      
      * nits nits
      
      * Apply suggestions from code review
      Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
      
      * updates based on review
      
      * some small update to potentially remove
      
      * nits
      
      * import rlu cache
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
      
      * move warning to `from_pretrained`
      
      * update tests results now that the special tokens are always added
      
      ---------
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
      Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
      2da88537
  4. 14 Sep, 2023 1 commit
    • Matt's avatar
      Overhaul Conversation class and prompt templating (#25323) · 866df66f
      Matt authored
      
      
      * First commit while I figure this out
      
      * make fixup
      
      * Remove unused method
      
      * Store prompt attrib
      
      * Fix prompt argument for tests
      
      * Make same changes in fast tokenizer
      
      * Remove global prompts from fast tokenizer too
      
      * stash commit
      
      * stash commit
      
      * Migrate PromptConfig to its True Final Location
      
      * Replace Conversation entirely with the new class
      
      * Import/dependency fixes
      
      * Import/dependency fixes
      
      * Change format for lots of default prompts
      
      * More default prompt fixups
      
      * Revert llama old methods so we can compare
      
      * Fix some default configs
      
      * Fix some default configs
      
      * Fix misspelled kwarg
      
      * Fixes for Blenderbot
      
      * make fixup
      
      * little rebase cleanup
      
      * Add basic documentation
      
      * Quick doc fix
      
      * Truncate docstring for now
      
      * Add handling for the case when messages is a single string
      
      * Quick llama merges
      
      * Update conversational pipeline and tests
      
      * Add a couple of legacy properties for backward compatibility
      
      * More legacy handling
      
      * Add docstring for build_conversation_input_ids
      
      * Restructure PromptConfig
      
      * Let's start T E M P L A T I N G
      
      * Refactor all default configs to use templates instead
      
      * Revert changes to the special token properties since we don't need them anymore
      
      * More class templates
      
      * Make the sandbox even sandier
      
      * Everything replaced with pure templating
      
      * Remove docs for PromptConfig
      
      * Add testing and optional requirement boilerplate
      
      * Fix imports and make fixup
      
      * Fix LLaMA tests and add Conversation docstring
      
      * Finally get LLaMA working with the template system
      
      * Finally get LLaMA working with the template system
      
      * make fixup
      
      * make fixup
      
      * fmt-off for the long lists of test tokens
      
      * Rename method to apply_chat_template for now
      
      * Start on documentation
      
      * Make chat_template a property that reads through to the default if it's not set
      
      * Expand docs
      
      * Expand chat templating doc some more
      
      * trim/lstrip blocks by default and update doc
      
      * Few doc tweaks
      
      * rebase cleanup
      
      * Clarify docstring
      
      * rebase cleanup
      
      * rebase cleanup
      
      * make fixup
      
      * Quick doc edit
      
      * Reformat the standard template to match ChatML
      
      * Re-add PEFT check
      
      * Update docs/source/en/chat_templating.md
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      
      * Add apply_chat_template to the tokenizer doc
      
      * make fixup
      
      * Add doc links
      
      * Fix chat links
      
      * Fix chat links
      
      * Explain system messages in the doc
      
      * Add chat template test
      
      * Proper save-loading for chat template attribute
      
      * Add test skips for layout models
      
      * Remove _build_conversation_input_ids, add default_chat_template to code_llama
      
      * Make sure all LLaMA models are using the latest template
      
      * Remove default_system_prompt block in code_llama because it has no default prompt
      
      * Update ConversationPipeline preprocess
      
      * Add correct #Copied from links to the default_chat_templates
      
      * Remove unneeded type checking line
      
      * Add a dummy mark_processsed method
      
      * Reorganize Conversation to have **deprecated_kwargs
      
      * Update chat_templating.md
      
      * Quick fix to LLAMA tests
      
      * Small doc tweaks
      
      * Add proper docstrings and "copied from" statements to all default chat templates
      
      * Merge use_default_system_prompt support for code_llama too
      
      * Improve clarity around self.chat_template
      
      * Docstring fix
      
      * Fix blenderbot default template
      
      * More doctest fix
      
      * Break out some tokenizer kwargs
      
      * Update doc to explain default templates
      
      * Quick tweaks to tokenizer args
      
      * Cleanups for tokenizer args
      
      * Add note about cacheing
      
      * Quick tweak to the chat-templating doc
      
      * Update the LLaMA template with error checking and correct system message embedding
      
      * make fixup
      
      * make fixup
      
      * add requires_jinja
      
      * Cleanup to expected output formatting
      
      * Add cacheing
      
      * Fix typo in llama default template
      
      * Update LLaMA tests
      
      * Update documentation
      
      * Improved legacy handling in the Conversation class
      
      * Update Jinja template with proper error handling
      
      * Quick bugfix
      
      * Proper exception raising
      
      * Change cacheing behaviour so it doesn't try to pickle an entire Jinja env
      
      * make fixup
      
      * rebase cleanup
      
      ---------
      Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
      866df66f
  5. 07 Feb, 2023 1 commit
  6. 06 Feb, 2023 1 commit
    • Sylvain Gugger's avatar
      Update quality tooling for formatting (#21480) · 6f79d264
      Sylvain Gugger authored
      * Result of black 23.1
      
      * Update target to Python 3.7
      
      * Switch flake8 to ruff
      
      * Configure isort
      
      * Configure isort
      
      * Apply isort with line limit
      
      * Put the right black version
      
      * adapt black in check copies
      
      * Fix copies
      6f79d264
  7. 15 Sep, 2022 1 commit
  8. 14 Sep, 2022 1 commit
  9. 12 May, 2022 1 commit
  10. 03 May, 2022 1 commit
    • Yih-Dar's avatar
      Move test model folders (#17034) · 19420fd9
      Yih-Dar authored
      
      
      * move test model folders (TODO: fix imports and others)
      
      * fix (potentially partially) imports (in model test modules)
      
      * fix (potentially partially) imports (in tokenization test modules)
      
      * fix (potentially partially) imports (in feature extraction test modules)
      
      * fix import utils.test_modeling_tf_core
      
      * fix path ../fixtures/
      
      * fix imports about generation.test_generation_flax_utils
      
      * fix more imports
      
      * fix fixture path
      
      * fix get_test_dir
      
      * update module_to_test_file
      
      * fix get_tests_dir from wrong transformers.utils
      
      * update config.yml (CircleCI)
      
      * fix style
      
      * remove missing imports
      
      * update new model script
      
      * update check_repo
      
      * update SPECIAL_MODULE_TO_TEST_MAP
      
      * fix style
      
      * add __init__
      
      * update self-scheduled
      
      * fix add_new_model scripts
      
      * check one way to get location back
      
      * python setup.py build install
      
      * fix import in test auto
      
      * update self-scheduled.yml
      
      * update slack notification script
      
      * Add comments about artifact names
      
      * fix for yolos
      Co-authored-by: default avatarydshieh <ydshieh@users.noreply.github.com>
      19420fd9
  11. 23 Feb, 2022 1 commit
  12. 31 Mar, 2021 1 commit
  13. 02 Feb, 2021 1 commit
  14. 12 Jan, 2021 1 commit
    • Sylvain Gugger's avatar
      Refactor `prepare_seq2seq_batch` (#9524) · 063d8d27
      Sylvain Gugger authored
      * Add target contextmanager and rework prepare_seq2seq_batch
      
      * Fix tests, treat BART and Barthez
      
      * Add last tokenizers
      
      * Fix test
      
      * Set src token before calling the superclass
      
      * Remove special behavior for T5
      
      * Remove needless imports
      
      * Remove needless asserts
      063d8d27
  15. 07 Dec, 2020 1 commit
  16. 17 Nov, 2020 1 commit
    • Sylvain Gugger's avatar
      Reorganize repo (#8580) · c89bdfbe
      Sylvain Gugger authored
      * Put models in subfolders
      
      * Styling
      
      * Fix imports in tests
      
      * More fixes in test imports
      
      * Sneaky hidden imports
      
      * Fix imports in doc files
      
      * More sneaky imports
      
      * Finish fixing tests
      
      * Fix examples
      
      * Fix path for copies
      
      * More fixes for examples
      
      * Fix dummy files
      
      * More fixes for example
      
      * More model import fixes
      
      * Is this why you're unhappy GitHub?
      
      * Fix imports in conver command
      c89bdfbe
  17. 18 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659) · ba8c4d0a
      Thomas Wolf authored
      * splitting fast and slow tokenizers [WIP]
      
      * [WIP] splitting sentencepiece and tokenizers dependencies
      
      * update dummy objects
      
      * add name_or_path to models and tokenizers
      
      * prefix added to file names
      
      * prefix
      
      * styling + quality
      
      * spliting all the tokenizer files - sorting sentencepiece based ones
      
      * update tokenizer version up to 0.9.0
      
      * remove hard dependency on sentencepiece 🎉
      
      * and removed hard dependency on tokenizers 🎉
      
      
      
      * update conversion script
      
      * update missing models
      
      * fixing tests
      
      * move test_tokenization_fast to main tokenization tests - fix bugs
      
      * bump up tokenizers
      
      * fix bert_generation
      
      * update ad fix several tokenizers
      
      * keep sentencepiece in deps for now
      
      * fix funnel and deberta tests
      
      * fix fsmt
      
      * fix marian tests
      
      * fix layoutlm
      
      * fix squeezebert and gpt2
      
      * fix T5 tokenization
      
      * fix xlnet tests
      
      * style
      
      * fix mbart
      
      * bump up tokenizers to 0.9.2
      
      * fix model tests
      
      * fix tf models
      
      * fix seq2seq examples
      
      * fix tests without sentencepiece
      
      * fix slow => fast  conversion without sentencepiece
      
      * update auto and bert generation tests
      
      * fix mbart tests
      
      * fix auto and common test without tokenizers
      
      * fix tests without tokenizers
      
      * clean up tests lighten up when tokenizers + sentencepiece are both off
      
      * style quality and tests fixing
      
      * add sentencepiece to doc/examples reqs
      
      * leave sentencepiece on for now
      
      * style quality split hebert and fix pegasus
      
      * WIP Herbert fast
      
      * add sample_text_no_unicode and fix hebert tokenization
      
      * skip FSMT example test for now
      
      * fix style
      
      * fix fsmt in example tests
      
      * update following Lysandre and Sylvain's comments
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/testing_utils.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      ba8c4d0a
  18. 08 Oct, 2020 1 commit
    • Thomas Wolf's avatar
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove... · 9aeacb58
      Thomas Wolf authored
      
      Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
      
      * [WIP] SP tokenizers
      
      * fixing tests for T5
      
      * WIP tokenizers
      
      * serialization
      
      * update T5
      
      * WIP T5 tokenization
      
      * slow to fast conversion script
      
      * Refactoring to move tokenzier implementations inside transformers
      
      * Adding gpt - refactoring - quality
      
      * WIP adding several tokenizers to the fast world
      
      * WIP Roberta - moving implementations
      
      * update to dev4 switch file loading to in-memory loading
      
      * Updating and fixing
      
      * advancing on the tokenizers - updating do_lower_case
      
      * style and quality
      
      * moving forward with tokenizers conversion and tests
      
      * MBart, T5
      
      * dumping the fast version of transformer XL
      
      * Adding to autotokenizers + style/quality
      
      * update init and space_between_special_tokens
      
      * style and quality
      
      * bump up tokenizers version
      
      * add protobuf
      
      * fix pickle Bert JP with Mecab
      
      * fix newly added tokenizers
      
      * style and quality
      
      * fix bert japanese
      
      * fix funnel
      
      * limite tokenizer warning to one occurence
      
      * clean up file
      
      * fix new tokenizers
      
      * fast tokenizers deep tests
      
      * WIP adding all the special fast tests on the new fast tokenizers
      
      * quick fix
      
      * adding more fast tokenizers in the fast tests
      
      * all tokenizers in fast version tested
      
      * Adding BertGenerationFast
      
      * bump up setup.py for CI
      
      * remove BertGenerationFast (too early)
      
      * bump up tokenizers version
      
      * Clean old docstrings
      
      * Typo
      
      * Update following Lysandre comments
      Co-authored-by: default avatarSylvain Gugger <sylvain.gugger@gmail.com>
      9aeacb58
  19. 15 Jun, 2020 1 commit
    • Anthony MOI's avatar
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized... · 36434220
      Anthony MOI authored
      
      [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
      
      * Use tokenizers pre-tokenized pipeline
      
      * failing pretrokenized test
      
      * Fix is_pretokenized in python
      
      * add pretokenized tests
      
      * style and quality
      
      * better tests for batched pretokenized inputs
      
      * tokenizers clean up - new padding_strategy - split the files
      
      * [HUGE] refactoring tokenizers - padding - truncation - tests
      
      * style and quality
      
      * bump up requied tokenizers version to 0.8.0-rc1
      
      * switched padding/truncation API - simpler better backward compat
      
      * updating tests for custom tokenizers
      
      * style and quality - tests on pad
      
      * fix QA pipeline
      
      * fix backward compatibility for max_length only
      
      * style and quality
      
      * Various cleans up - add verbose
      
      * fix tests
      
      * update docstrings
      
      * Fix tests
      
      * Docs reformatted
      
      * __call__ method documented
      Co-authored-by: default avatarThomas Wolf <thomwolf@users.noreply.github.com>
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      36434220
  20. 06 Apr, 2020 1 commit
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 🙏🙏
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
  21. 15 Jan, 2020 1 commit
  22. 06 Jan, 2020 2 commits
  23. 05 Jan, 2020 1 commit
  24. 24 Dec, 2019 1 commit
  25. 22 Dec, 2019 8 commits
  26. 21 Dec, 2019 1 commit
    • Aymeric Augustin's avatar
      Reformat source code with black. · fa84ae26
      Aymeric Augustin authored
      This is the result of:
      
          $ black --line-length 119 examples templates transformers utils hubconf.py setup.py
      
      There's a lot of fairly long lines in the project. As a consequence, I'm
      picking the longest widely accepted line length, 119 characters.
      
      This is also Thomas' preference, because it allows for explicit variable
      names, to make the code easier to understand.
      fa84ae26
  27. 13 Dec, 2019 3 commits
  28. 26 Sep, 2019 2 commits
  29. 30 Aug, 2019 1 commit