1. 12 Nov, 2021 3 commits
    • Patrick von Platen's avatar
      [Wav2Vec2 Example] Improve fine-tuning script (#14373) · 55f49c5f
      Patrick von Platen authored
      * improve some stuff
      
      * finish
      
      * correct last
      55f49c5f
    • Suraj Patil's avatar
      fix docs (#14377) · 21546e59
      Suraj Patil authored
      21546e59
    • Nicolas Patry's avatar
      Adding support for raw python `generator` in addition to `Dataset` for pipelines (#14352) · ed5d1551
      Nicolas Patry authored
      * Adding support for raw python `generator` in addition to `Dataset`
      
      The main goal is to ease the create of streaming data to the pipe.
      
      `Dataset` is more involved and pytorch specific.
      
      This PR, provides a way to use a python iterator too.
      This enabled #14250 but can be proposed as a standalone PR.
      
      ```python
      from transformers import pipeline
      
      def read_data(filename):
          with open(filename, 'r') as f:
              for line in f:
                  yield f
      
      pipe = pipeline("text-classification")
      for classified in pipe(read_data("large_file.txt")):
          print("Success ! ", classified)
      ```
      
      The main caveat of this, is the interaction with `DataLoader` with
      `num_workers>1`. When you have multiple workers, each receive a copy
      of the generator (like `IterableDataset`). That means the naive Iterator
      will fail since all workers iterate on all items of the generator.
      
      There are ways to do clever "skipping", but it could be bad still
      because all workers still do have to pass through all items of the
      generator (they just ignore items they don't handle), depending on
      the case it might be bad.
      
      Using `num_workers=1` is the simplest fix and if the cost of loading
      your data is small enough should be good enough. In the above example
      trying to do smart tricks to skip some lines is unlikely to be a net
      positive for instance.
      
      If there are better ways to do "jumps" on some data, then using
      `Dataset` is more advised (since then differents workers can just jump
      themselves).
      
      * Adding iterator support for `tf` too.
      ed5d1551
  2. 11 Nov, 2021 7 commits
  3. 10 Nov, 2021 6 commits
  4. 09 Nov, 2021 10 commits
  5. 08 Nov, 2021 8 commits
  6. 06 Nov, 2021 4 commits
  7. 05 Nov, 2021 2 commits