• Nicolas Patry's avatar
    Adding support for raw python `generator` in addition to `Dataset` for pipelines (#14352) · ed5d1551
    Nicolas Patry authored
    * Adding support for raw python `generator` in addition to `Dataset`
    
    The main goal is to ease the create of streaming data to the pipe.
    
    `Dataset` is more involved and pytorch specific.
    
    This PR, provides a way to use a python iterator too.
    This enabled #14250 but can be proposed as a standalone PR.
    
    ```python
    from transformers import pipeline
    
    def read_data(filename):
        with open(filename, 'r') as f:
            for line in f:
                yield f
    
    pipe = pipeline("text-classification")
    for classified in pipe(read_data("large_file.txt")):
        print("Success ! ", classified)
    ```
    
    The main caveat of this, is the interaction with `DataLoader` with
    `num_workers>1`. When you have multiple workers, each receive a copy
    of the generator (like `IterableDataset`). That means the naive Iterator
    will fail since all workers iterate on all items of the generator.
    
    There are ways to do clever "skipping", but it could be bad still
    because all workers still do have to pass through all items of the
    generator (they just ignore items they don't handle), depending on
    the case it might be bad.
    
    Using `num_workers=1` is the simplest fix and if the cost of loading
    your data is small enough should be good enough. In the above example
    trying to do smart tricks to skip some lines is unlikely to be a net
    positive for instance.
    
    If there are better ways to do "jumps" on some data, then using
    `Dataset` is more advised (since then differents workers can just jump
    themselves).
    
    * Adding iterator support for `tf` too.
    ed5d1551
test_pipelines_common.py 15.5 KB