1. 13 Apr, 2020 2 commits
  2. 11 Apr, 2020 2 commits
  3. 10 Apr, 2020 7 commits
    • Jin Young Sohn's avatar
    • Anthony MOI's avatar
      Update tokenizers to 0.7.0-rc5 (#3705) · b7cf9f43
      Anthony MOI authored
      b7cf9f43
    • Jin Young Sohn's avatar
      Add `run_glue_tpu.py` that trains models on TPUs (#3702) · 551b4505
      Jin Young Sohn authored
      * Initial commit to get BERT + run_glue.py on TPU
      
      * Add README section for TPU and address comments.
      
      * Cleanup TPU bits from run_glue.py (#3)
      
      TPU runner is currently implemented in:
      https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.
      
      We plan to upstream this directly into `huggingface/transformers`
      (either `master` or `tpu`) branch once it's been more thoroughly tested.
      
      * Cleanup TPU bits from run_glue.py
      
      TPU runner is currently implemented in:
      https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py
      
      .
      
      We plan to upstream this directly into `huggingface/transformers`
      (either `master` or `tpu`) branch once it's been more thoroughly tested.
      
      * No need to call `xm.mark_step()` explicitly (#4)
      
      Since for gradient accumulation we're accumulating on batches from
      `ParallelLoader` instance which on next() marks the step itself.
      
      * Resolve R/W conflicts from multiprocessing (#5)
      
      * Add XLNet in list of models for `run_glue_tpu.py` (#6)
      
      * Add RoBERTa to list of models in TPU GLUE (#7)
      
      * Add RoBERTa and DistilBert to list of models in TPU GLUE (#8)
      
      * Use barriers to reduce duplicate work/resources (#9)
      
      * Shard eval dataset and aggregate eval metrics (#10)
      
      * Shard eval dataset and aggregate eval metrics
      
      Also, instead of calling `eval_loss.item()` every time do summation with
      tensors on device.
      
      * Change defaultdict to float
      
      * Reduce the pred, label tensors instead of metrics
      
      As brought up during review some metrics like f1 cannot be aggregated
      via averaging. GLUE task metrics depends largely on the dataset, so
      instead we sync the prediction and label tensors so that the metrics can
      be computed accurately on those instead.
      
      * Only use tb_writer from master (#11)
      
      * Apply huggingface black code formatting
      
      * Style
      
      * Remove `--do_lower_case` as example uses cased
      
      * Add option to specify tensorboard logdir
      
      This is needed for our testing framework which checks regressions
      against key metrics writtern by the summary writer.
      
      * Using configuration for `xla_device`
      
      * Prefix TPU specific comments.
      
      * num_cores clarification and namespace eval metrics
      
      * Cache features file under `args.cache_dir`
      
      Instead of under `args.data_dir`. This is needed as our test infra uses
      data_dir with a read-only filesystem.
      
      * Rename `run_glue_tpu` to `run_tpu_glue`
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      551b4505
    • Julien Chaumond's avatar
    • Julien Chaumond's avatar
      [examples] Generate argparsers from type hints on dataclasses (#3669) · b169ac9c
      Julien Chaumond authored
      * [examples] Generate argparsers from type hints on dataclasses
      
      * [HfArgumentParser] way simpler API
      
      * Restore run_language_modeling.py for easier diff
      
      * [HfArgumentParser] final tweaks from code review
      b169ac9c
    • Sam Shleifer's avatar
      Multilingual BART - (#3602) · 7a7fdf71
      Sam Shleifer authored
      - support mbart-en-ro weights
      - add MBartTokenizer
      7a7fdf71
    • Julien Chaumond's avatar
      Big cleanup of `glue_convert_examples_to_features` (#3688) · f98d0ef2
      Julien Chaumond authored
      * Big cleanup of `glue_convert_examples_to_features`
      
      * Use batch_encode_plus
      
      * Cleaner wrapping of glue_convert_examples_to_features for TF
      
      @lysandrejik
      
      * Cleanup syntax, thanks to @mfuntowicz
      
      * Raise explicit error in case of user error
      f98d0ef2
  4. 09 Apr, 2020 5 commits
  5. 08 Apr, 2020 6 commits
  6. 07 Apr, 2020 8 commits
  7. 06 Apr, 2020 10 commits
    • Teven's avatar
      0a9d09b4
    • Funtowicz Morgan's avatar
      Tokenizers v3.0.0 (#3185) · 96ab75b8
      Funtowicz Morgan authored
      
      
      * Renamed num_added_tokens to num_special_tokens_to_add
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Cherry-Pick: Partially fix space only input without special tokens added to the output #3091
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Make fast tokenizers unittests work on Windows.
      
      * Entirely refactored unittest for tokenizers fast.
      
      * Remove ABC class for CommonFastTokenizerTest
      
      * Added embeded_special_tokens tests from allenai @dirkgr
      
      * Make embeded_special_tokens tests from allenai more generic
      
      * Uniformize vocab_size as a property for both Fast and normal tokenizers
      
      * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)
      
      * Ensure providing None input raise the same ValueError than Python tokenizer + tests.
      
      * Fix invalid input for assert_padding when testing batch_encode_plus
      
      * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.
      
      * Ensure tokenize() correctly forward add_special_tokens to rust.
      
      * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
      Avoid stripping on None values.
      
      * unittests ensure tokenize() also throws a ValueError if provided None
      
      * Added add_special_tokens unittest for all supported models.
      
      * Style
      
      * Make sure TransfoXL test run only if PyTorch is provided.
      
      * Split up tokenizers tests for each model type.
      
      * Fix invalid unittest with new tokenizers API.
      
      * Filter out Roberta openai detector models from unittests.
      
      * Introduce BatchEncoding on fast tokenizers path.
      
      This new structure exposes all the mappings retrieved from Rust.
      It also keeps the current behavior with model forward.
      
      * Introduce BatchEncoding on slow tokenizers path.
      
      Backward compatibility.
      
      * Improve error message on BatchEncoding for slow path
      
      * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.
      
      * Style and format.
      
      * Added typing on all methods for PretrainedTokenizerFast
      
      * Style and format
      
      * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.
      
      * Style and format
      
      * encode_plus now supports pretokenized inputs.
      
      * Remove user warning about add_special_tokens when working on pretokenized inputs.
      
      * Always go through the post processor.
      
      * Added support for pretokenized input pairs on encode_plus
      
      * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.
      
      * Added pretokenized inputs support on batch_encode_plus
      
      * Update BatchEncoding methods name to match Encoding.
      
      * Bump setup.py tokenizers dependency to 0.7.0rc1
      
      * Remove unused parameters in BertTokenizerFast
      
      * Make sure Roberta returns token_type_ids for unittests.
      
      * Added missing typings
      
      * Update add_tokens prototype to match tokenizers side and allow AddedToken
      
      * Bumping tokenizers to 0.7.0rc2
      
      * Added documentation for BatchEncoding
      
      * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.
      
      * Added higher-level typing for tokenize / encode_plus / batch_encode_plus.
      
      * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.
      
      * Fix text-classification pipeline using the wrong tokenizer
      
      * Make pipelines works with BatchEncoding
      
      * Turn off add_special_tokens on tokenize by default.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove add_prefix_space from tokenize call in unittest.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style and quality
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Correct message for batch_encode_plus none input exception.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Fix invalid list comprehension for offset_mapping overriding content every iteration.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * TransfoXL uses Strip normalizer.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Bump tokenizers dependency to 0.7.0rc3
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Support AddedTokens for special_tokens and use left stripping on mask for Roberta.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * SpecilaTokenMixin can use slots to faster access to underlying attributes.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Remove update_special_tokens from fast tokenizers.
      
      * Ensure TransfoXL unittests are run only when torch is available.
      
      * Style.
      Signed-off-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      
      * Style
      
      * Style 馃檹馃檹
      
      
      
      * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.
      
      * Remove Roberta warning on __init__.
      
      * Move documentation to Google style.
      Co-authored-by: default avatarLysandreJik <lysandre.debut@reseau.eseo.fr>
      96ab75b8
    • Ethan Perez's avatar
      Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631) · e52d1258
      Ethan Perez authored
      * Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py
      
      `convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase.
      
      * Simplifying change to match recent commits
      e52d1258
    • ktrapeznikov's avatar
      Create README.md · 0ac33ddd
      ktrapeznikov authored
      0ac33ddd
    • Manuel Romero's avatar
      Add model card · 326e6eba
      Manuel Romero authored
      326e6eba
    • Manuel Romero's avatar
      Add model card · 43eca3f8
      Manuel Romero authored
      43eca3f8
    • Manuel Romero's avatar
      Create README.md · 6bec88ca
      Manuel Romero authored
      6bec88ca
    • Manuel Romero's avatar
      Add model card (#3655) · 769b60f9
      Manuel Romero authored
      * Add model card
      
      * Fix model name in fine-tuning script
      769b60f9
    • Manuel Romero's avatar
      Create model card (#3654) · c4bcb019
      Manuel Romero authored
      * Create model card
      
      * Fix model name in fine-tuning script
      c4bcb019
    • Manuel Romero's avatar
      Create README.md · 6903a987
      Manuel Romero authored
      6903a987