1. 02 Aug, 2022 1 commit
    • David's avatar
      Update pipeline word heuristic to work with whitespace in token offsets (#18402) · 042f4203
      David authored
      * Update pipeline word heuristic to work with whitespace in token offsets
      
      This change checks for whitespace in the input string at either the
      character preceding the token or in the first character of the token.
      This works with tokenizers that return offsets excluding whitespace
      between words or with offsets including whitespace.
      
      fixes #18111
      
      starting
      
      * Use smaller model, ensure expected tokenization
      
      * Re-run CI (please squash)
      042f4203
  2. 11 Jul, 2022 1 commit
  3. 18 Mar, 2022 1 commit
  4. 23 Feb, 2022 1 commit
  5. 08 Dec, 2021 1 commit
  6. 22 Nov, 2021 1 commit
  7. 04 Nov, 2021 1 commit
  8. 29 Oct, 2021 1 commit
  9. 06 Oct, 2021 1 commit
  10. 10 Sep, 2021 1 commit
    • Nicolas Patry's avatar
      [Large PR] Entire rework of pipelines. (#13308) · c63fcabf
      Nicolas Patry authored
      
      
      * Enabling dataset iteration on pipelines.
      
      Enabling dataset iteration on pipelines.
      
      Unifying parameters under `set_parameters` function.
      
      Small fix.
      
      Last fixes after rebase
      
      Remove print.
      
      Fixing text2text `generate_kwargs`
      
      No more `self.max_length`.
      
      Fixing tf only conversational.
      
      Consistency in start/stop index over TF/PT.
      
      Speeding up drastically on TF (nasty bug where max_length would increase
      a ton.)
      
      Adding test for support for non fast tokenizers.
      
      Fixign GPU usage on zero-shot.
      
      Fix working on Tf.
      
      Update src/transformers/pipelines/base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      Update src/transformers/pipelines/base.py
      Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
      
      Small cleanup.
      
      Remove all asserts + simple format.
      
      * Fixing audio-classification for large PR.
      
      * Overly explicity null checking.
      
      * Encapsulating GPU/CPU pytorch manipulation directly within `base.py`.
      
      * Removed internal state for parameters of the  pipeline.
      
      Instead of overriding implicitly internal state, we moved
      to real named arguments on every `preprocess`, `_forward`,
      `postprocess` function.
      
      Instead `_sanitize_parameters` will be used to split all kwargs
      of both __init__ and __call__ into the 3 kinds of named parameters.
      
      * Move import warnings.
      
      * Small fixes.
      
      * Quality.
      
      * Another small fix, using the CI to debug faster.
      
      * Last fixes.
      
      * Last fix.
      
      * Small cleanup of tensor moving.
      
      * is not None.
      
      * Adding a bunch of docs + a iteration test.
      
      * Fixing doc style.
      
      * KeyDataset = None guard.
      
      * RRemoving the Cuda test for pipelines (was testing).
      
      * Even more simple iteration test.
      
      * Correct import .
      
      * Long day.
      
      * Fixes in docs.
      
      * [WIP] migrating object detection.
      
      * Fixed the target_size bug.
      
      * Fixup.
      
      * Bad variable name.
      
      * Fixing `ensure_on_device` respects original ModelOutput.
      c63fcabf
  11. 09 Sep, 2021 1 commit
  12. 27 Aug, 2021 1 commit
  13. 26 Jul, 2021 1 commit
    • Nicolas Patry's avatar
      Better heuristic for token-classification pipeline. (#12611) · a3bd7637
      Nicolas Patry authored
      * Better heuristic for token-classification pipeline.
      
      Relooking at the problem makes thing actually much simpler,
      when we look at ids from a tokenizer, we have no way in **general**
      to recover if some substring is part of a word or not.
      
      However, within the pipeline, with offsets we still have access to the
      original string, so we can simply look if previous character (if it
      exists) of a token, is actually a space. This will obviously be wrong
      for tokenizers that contain spaces within tokens, tokenizers where
      offsets include spaces too (Don't think there are a lot).
      
      This heuristic hopefully is fully bc and still can handle non-word based
      tokenizers.
      
      * Updating test with real values.
      
      * We still need the older "correct" heuristic to prevent fusing
      punctuation.
      
      * Adding a real warning when important.
      a3bd7637
  14. 09 Jul, 2021 1 commit
  15. 18 May, 2021 2 commits
    • Vyom Pathak's avatar
      Fixed: Better names for nlp variables in pipelines' tests and docs. (#11752) · fd3b12e8
      Vyom Pathak authored
      * Fixed: Better names for nlp variables in pipelines' tests and docs.
      
      * Fixed: Better variable names
      fd3b12e8
    • Nicolas Patry's avatar
      [TokenClassification] Label realignment for subword aggregation (#11680) · b88e0e01
      Nicolas Patry authored
      * [TokenClassification] Label realignment for subword aggregation
      
      Tentative to replace https://github.com/huggingface/transformers/pull/11622/files
      
      
      
      - Added `AggregationStrategy`
      - `ignore_subwords` and `grouped_entities` arguments are now fused
        into `aggregation_strategy`. It makes more sense anyway because
        `ignore_subwords=True` with `grouped_entities=False` did not have a
        meaning anyway.
      - Added 2 new ways to aggregate which are MAX, and AVERAGE
      - AVERAGE requires a bit more information than the others, for now this
      case is slightly specific, we should keep that in mind for future
      changes.
      - Testing has been modified to reflect new argument, and to check the
      correct deprecation and the new aggregation_strategy.
      - Put the testing argument and testing results for aggregation_strategy,
      close together, so that readers can understand what is supposed to
      happen.
      - `aggregate` is now only tested on a small model as it does not mean
      anything to test it globally for all models.
      - Previous tests are unchanged in desired output.
      - Added a new test case that showcases better the difference between the
        FIRST, MAX and AVERAGE strategies.
      
      * Wrong framework.
      
      * Addressing three issues.
      
      1- Tags might not follow B-, I- convention, so any tag should work now
      (assumed as B-TAG)
      2- Fixed an issue with average that leads to a substantial code change.
      3- The testing suite was not checking for the "index" key for "none"
      strategy. This is now fixed.
      
      The issue is that "O" could not be chosen by AVERAGE strategy because
      those tokens were filtered out beforehand, so their relative scores were
      not counted in the average. Now filtering on
      ignore_labels will happen at the very end of the pipeline fixing
      that issue.
      It's a bit hard to make sure this stays like that because we do
      not have a end-to-end test for that behavior
      
      * Formatting.
      
      * Adding formatting to code + cleaner handling of B-, I- tags.
      Co-authored-by: default avatarFrancesco Rubbo <rubbo.francesco@gmail.com>
      Co-authored-by: default avatarelk-cloner <rezakakhki.rk@gmail.com>
      
      * Typo.
      Co-authored-by: default avatarFrancesco Rubbo <rubbo.francesco@gmail.com>
      Co-authored-by: default avatarelk-cloner <rezakakhki.rk@gmail.com>
      b88e0e01
  16. 15 Apr, 2021 1 commit
  17. 15 Feb, 2021 1 commit
  18. 07 Dec, 2020 1 commit
  19. 30 Nov, 2020 1 commit
    • Nicolas Patry's avatar
      NerPipeline (TokenClassification) now outputs offsets of words (#8781) · d8fc26e9
      Nicolas Patry authored
      * NerPipeline (TokenClassification) now outputs offsets of words
      
      - It happens that the offsets are missing, it forces the user to pattern
      match the "word" from his input, which is not always feasible.
      For instance if a sentence contains the same word twice, then there
      is no way to know which is which.
      - This PR proposes to fix that by outputting 2 new keys for this
      pipelines outputs, "start" and "end", which correspond to the string
      offsets of the word. That means that we should always have the
      invariant:
      
      ```python
      input[entity["start"]: entity["end"]] == entity["entity_group"]
                                          # or entity["entity"] if not grouped
      ```
      
      * Fixing doc style
      d8fc26e9
  20. 15 Nov, 2020 1 commit
    • Thomas Wolf's avatar
      [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests... · f4e04cd2
      Thomas Wolf authored
      
      [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)
      
      * Fixing roberta for slow-fast tests
      
      * WIP getting equivalence on pipelines
      
      * slow-to-fast equivalence - working on question-answering pipeline
      
      * optional FAISS tests
      
      * Pipeline Q&A
      
      * Move pipeline tests to their own test job again
      
      * update tokenizer to add sequence id methods
      
      * update to tokenizers 0.9.4
      
      * set sentencepiecce as optional
      
      * clean up squad
      
      * clean up pipelines to use sequence_ids
      
      * style/quality
      
      * wording
      
      * Switch to use_fast = True by default
      
      * update tests for use_fast at True by default
      
      * fix rag tokenizer test
      
      * removing protobuf from required dependencies
      
      * fix NER test for use_fast = True by default
      
      * fixing example tests (Q&A examples use slow tokenizers for now)
      
      * protobuf in main deps extras["sentencepiece"] and example deps
      
      * fix protobug install test
      
      * try to fix seq2seq by switching to slow tokenizers for now
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      
      * Update src/transformers/tokenization_utils_base.py
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      f4e04cd2
  21. 10 Nov, 2020 1 commit
  22. 03 Nov, 2020 1 commit
    • Ceyda Cinarel's avatar
      [WIP] Ner pipeline grouped_entities fixes (#5970) · 29b536a7
      Ceyda Cinarel authored
      
      
      * Bug fix: NER pipeline shouldn't group separate entities of same type
      
      * style fix
      
      * [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
      	(B-type1 B-type1) != (B-type1 I-type1)
      [Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
      	The simplest fix is to just group the subwords with the first wordpiece.
      	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
      	[TODO] handle different wordpiece_prefix ## ? possible approaches:
      		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
      		have an _is_subword(token)
      [Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
      [Additional Changes] remove B/I prefix on returned grouped_entities
      [Feature Request/TODO] Return indexes?
      [Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')
      
      * use offset_mapping to fix [UNK] token problem
      
      * ignore score for subwords
      
      * modify ner_pipeline test
      
      * modify ner_pipeline test
      
      * modify ner_pipeline test
      
      * ner_pipeline change ignore_subwords default to true
      
      * add ner_pipeline ignore_subword=False test case
      
      * fix offset_mapping index
      
      * fix style again duh
      
      * change is_subword and convert_tokens_to_string logic
      
      * merge tests with new test structure
      
      * change test names
      
      * remove old tests
      
      * ner tests for fast tokenizer
      
      * fast tokenizers have convert_tokens_to_string
      
      * Fix the incorrect merge
      Co-authored-by: default avatarCeyda Cinarel <snu-ceyda@users.noreply.github.com>
      Co-authored-by: default avatarLysandre Debut <lysandre@huggingface.co>
      Co-authored-by: default avatarLysandre <lysandre.debut@reseau.eseo.fr>
      29b536a7
  23. 23 Oct, 2020 1 commit