"src/git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "d7d6841406a2cef52da26fc58342e543b5cd9e1d"
Integrate with Apache Arrow/Plasma in-memory store for large datasets (#995)
Summary: Datasets with many examples can generate very large indexes in TokenBlockDataset (and possibly elsewhere). When using `--num-workers>0` these indexes are pickled and transferred via a multiprocessing pipe, which is slow and can fail if the index grows beyond 4GB (~0.5B examples). Apache Arrow has an in-memory store called Plasma that will offload these arrays to shared memory, which both reduces duplication of the data and avoids needing to pickle. Pull Request resolved: https://github.com/pytorch/fairseq/pull/995 Differential Revision: D16697219 Pulled By: myleott fbshipit-source-id: 1b679ee5b3d2726af54ff418f6159a3671173fb8
Showing
fairseq/data/plasma_utils.py
0 → 100644
Please register or sign in to comment