1. 15 Feb, 2022 2 commits
  2. 14 Feb, 2022 1 commit
    • Sylvain Gugger's avatar
      Register feature extractor (#15634) · 2e11a043
      Sylvain Gugger authored
      * Rework AutoFeatureExtractor.from_pretrained internal
      
      * Custom feature extractor
      
      * Add more tests
      
      * Add support for custom feature extractor code
      
      * Clean up
      
      * Add register API to AutoFeatureExtractor
      2e11a043
  3. 11 Feb, 2022 4 commits
  4. 10 Feb, 2022 4 commits
  5. 09 Feb, 2022 8 commits
  6. 08 Feb, 2022 2 commits
  7. 07 Feb, 2022 6 commits
  8. 04 Feb, 2022 2 commits
  9. 03 Feb, 2022 3 commits
  10. 02 Feb, 2022 7 commits
    • CHI LIU's avatar
      Correct eos_token_id settings in generate (#15403) · 5ec368d7
      CHI LIU authored
      * Correct eos_token_id set in generate
      
      * Set eos_token_id in test
      
      * Correct eos_token_id set in generate
      
      * Set eos_token_id in test
      5ec368d7
    • SaulLu's avatar
      fix set truncation attribute in `__init__` of `PreTrainedTokenizerBase` (#15456) · 39b5d1a6
      SaulLu authored
      
      
      * change truncation_side in init of `PreTrainedTokenizerBase`
      Co-authored-by: default avatarLSinev <LSinev@users.noreply.github.com>
      
      * add test
      
      * Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"
      
      This reverts commit 7a98b87962d2635c7e4d4f00db3948b694624843.
      
      * fix kwargs
      
      * Revert "fix kwargs"
      
      This reverts commit 67b0a5270e8cf1dbf70e6b0232e94c0452b6946f.
      
      * Update tests/test_tokenization_common.py
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * delete truncation_side variable
      
      * reorganize test
      
      * format
      
      * complete doc
      
      * Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`""
      
      This reverts commit d5a10a7e2680539e5d9e98ae5d896c893d224b80.
      
      * fix typo
      
      * fix typos to render documentation
      
      * Revert "Revert "Revert "replace assert with exception for `padding_side` arg in `PreTrainedTokenizerBase` `__init__`"""
      
      This reverts commit 16cf58811943a08f43409a7c83eaa330686591d0.
      
      * format
      Co-authored-by: default avatarLSinev <LSinev@users.noreply.github.com>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      39b5d1a6
    • Ayush Chaurasia's avatar
      Add W&B backend for hyperparameter sweep (#14582) · c74f3d4c
      Ayush Chaurasia authored
      # Add support for W&B hyperparameter sweep
      This PR:
      * allows using wandb for running hyperparameter search.
      * The runs are visualized on W&B sweeps dashboard
      * This supports runnning sweeps on parallel devices, all reporting to the same central dashboard.
      
      ### Usage
      **To run new a hyperparameter search:**
      ```
      trainer.hyperparameter_search(
          backend="wandb", 
          project="transformers_sweep", # name of the project
          n_trials=5,
          metric="eval/loss", # metric to be optimized, default 'eval/loss'. A warning is raised if the passed metric is not found
      )
      ```
      This outputs a sweep id. Eg. `my_project/sweep_id`
      
      **To run sweeps on parallel devices:**
      Just pass sweep id which you want to run parallel
      ```
      trainer.hyperparameter_search(
          backend="wandb", 
          sweep_id = "my_project/sweep_id"
      )
      ```
      c74f3d4c
    • Sylvain Gugger's avatar
      Save code of registered custom models (#15379) · 44b21f11
      Sylvain Gugger authored
      
      
      * Allow dynamic modules to use relative imports
      
      * Work for configs
      
      * Fix last merge conflict
      
      * Save code of registered custom objects
      
      * Map strings to strings
      
      * Fix test
      
      * Add tokenizer
      
      * Rework tests
      
      * Tests
      
      * Ignore fixtures py files for tests
      
      * Tokenizer test + fix collection
      
      * With full path
      
      * Rework integration
      
      * Fix typo
      
      * Remove changes in conftest
      
      * Test for tokenizers
      
      * Add documentation
      
      * Update docs/source/custom_models.mdx
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      
      * Add file structure and file content
      
      * Add more doc
      
      * Style
      
      * Update docs/source/custom_models.mdx
      Co-authored-by: default avatarSuraj Patil <surajp815@gmail.com>
      
      * Address review comments
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarSuraj Patil <surajp815@gmail.com>
      44b21f11
    • Nicolas Patry's avatar
      Adding support for `microphone` streaming within pipeline. (#15046) · 623d8cb4
      Nicolas Patry authored
      
      
      * Adding support for `microphone` streaming within pipeline.
      
      - Uses `ffmpeg` to get microphone data.
      - Makes sure alignment is made to `size_of_sample`.
      - Works by sending `{"raw": ..data.., "stride": (n, left, right),
      "partial": bool}`
      directly to the pipeline enabling to stream partial results and still
      get inference.
      - Let's `partial` information flow through the pipeline to enable caller
        to get it back and choose to display text or not.
      
      - The striding reconstitution is bound to have errors since CTC does not
      keep previous state. Currently most of the errors are we don't know if
      there's a space or not between two chunks.
      Since we have some left striding info, we could use that during decoding
      to choose what to do with those spaces and even extra letters maybe (if
      the stride is long enough, it's bound to cover at least a few symbols)
      
      Fixing tests.
      
      Protecting with `require_torch`.
      
      `raw_ctc` support for nicer demo.
      
      Post rebase fixes.
      
      Revamp to split raw_mic_data from it's live chunking.
      
      - Requires a refactor to make everything a bit cleaner.
      
      Automatic resampling.
      
      Small fix.
      
      Small fix.
      
      * Post rebase fix (need to let super handle more logic, reorder args.)
      
      * Update docstrings
      
      * Docstring format.
      
      * Remove print.
      
      * Prevent flow of `input_values`.
      
      * Fixing `stride` too.
      
      * Fixing the PR by removing `raw_ctc`.
      
      * Better docstrings.
      
      * Fixing init.
      
      * Update src/transformers/pipelines/audio_utils.py
      Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>
      
      * Update tests/test_pipelines_automatic_speech_recognition.py
      Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>
      
      * Quality.
      Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>
      623d8cb4
    • Patrick von Platen's avatar
    • NielsRogge's avatar
      Add option to resize like torchvision's Resize (#15419) · 1d94d575
      NielsRogge authored
      * Add torchvision's resize
      
      * Rename torch_resize to default_to_square
      
      * Apply suggestions from code review
      
      * Add support for default_to_square and tuple of length 1
      1d94d575
  11. 01 Feb, 2022 1 commit
    • SaulLu's avatar
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast... · 7b8bdd86
      SaulLu authored
      fix the `tokenizer_config.json` file for the slow tokenizer when a fast version is available (#15319)
      
      * add new test
      
      * update test
      
      * remove `tokenizer_file` from `additional_files_names` in `tokenization_utils_base.py`
      
      * add `tokenizer_file` for the fast only tokenizer
      
      * change global variables layoutxml
      
      * remove `"tokenizer_file"` from DPR tokenizer's Global variables
      
      * remove `tokenizer_file` from herbert slow tokenizer init
      
      * `"tokenizer_file"` from LED tokenizer's Global variables
      
      * remove `tokenizer_file` from mbart slow tokenizer init
      
      * remove `tokenizer_file` from slow tokenizer template
      
      * adapt to versioning
      
      * adapt the `test_tokenizer_mismatch_warning` test
      
      * clean test
      
      * clarify `VOCAB_FILES_NAMES` in tokenization_utils_fast.py
      
      * Revert "remove `tokenizer_file` from mbart slow tokenizer init"
      
      This reverts commit 0dbb723fa9c7599d4640fe30b3647a74eb4a64e1.
      
      * Revert "`"tokenizer_file"` from LED tokenizer's Global variables"
      
      This reverts commit 5a3f879bdd651233f3d74a3d1146c34cde82b0c2.
      
      * Revert "remove `tokenizer_file` from herbert slow tokenizer init"
      
      This reverts commit f5e10007b7b0ec5345e015b9de7ffec72c5407fd.
      
      * Revert "remove `"tokenizer_file"` from DPR tokenizer's Global variables"
      
      This reverts commit da0895330bedfafc81ae3073470a9348c669f032.
      
      * set `tokenizer_file` in super `__init__` of mbart
      7b8bdd86