Integrate with Apache Arrow/Plasma in-memory store for large datasets (#995)
Summary: Datasets with many examples can generate very large indexes in TokenBlockDataset (and possibly elsewhere). When using `--num-workers>0` these indexes are pickled and transferred via a multiprocessing pipe, which is slow and can fail if the index grows beyond 4GB (~0.5B examples). Apache Arrow has an in-memory store called Plasma that will offload these arrays to shared memory, which both reduces duplication of the data and avoids needing to pickle. Pull Request resolved: https://github.com/pytorch/fairseq/pull/995 Differential Revision: D16697219 Pulled By: myleott fbshipit-source-id: 1b679ee5b3d2726af54ff418f6159a3671173fb8
Showing
fairseq/data/plasma_utils.py
0 → 100644
Please register or sign in to comment