Core usage is through the `DLataset` class, defined in `dlimp/dlimp/dataset.py`. It is a thin wrapper around `tf.data.Dataset` designed for working with datasets of trajectories; it has two creation methods, `from_tfrecords` and `from_rlds`. This library additionally provides a suite of *frame-level* and *trajectory-level* transforms designed to be used with `DLataset.frame_map` and `DLataset.traj_map`, respectively.
Scripts for converting various datasets to the dlimp TFRecord format (compatible with `DLataset.from_tfrecords`) can be found in `legacy_converters/`. This should be considered deprecated in favor of the RLDS format, converters for which can be found in `rlds_converters/` and will be expanded from now on.
Kinetics dataset is a set of short, 10 second, youtube clips with associated labels.
The dataset is split into 400, 600, and 700 classes. The dataset is available at https://deepmind.com/research/open-source/kinetics.
For seamless integration with DLIMP, follow instructions in https://github.com/cvdfoundation/kinetics-dataset for downloading the specific kinetics dataset of your choosing.
Then to preprocess for DLIMP, run num: 400, 600, 700:
Contains converters to the [RLDS format](https://github.com/google-research/rlds), which is a specification on top of the [TFDS](https://www.tensorflow.org/datasets)(TensorFlow datasets) format, which is for the most part built on top of the TFRecord format. RLDS datasets can be loaded using `dlimp.DLataset.from_rlds`.
Out of the box, TFDS only supports single-threaded dataset conversion and distributed dataset conversion using Apache Beam. `dataset_builder.py` contains a more middle-ground implementation that uses Python multiprocessing to parallelize conversion on a single machine. It is based heavily on Karl Pertsch's implementation (see [kpertsch/bridge_rlds_builder](https://github.com/kpertsch/bridge_rlds_builder/blob/f0d16c5a8384c1476aa1c274a9aef3a5f76cbada/bridge_dataset/conversion_utils.py)).
## Usage
Each subdirectory contains a specific dataset converter implementation that inherits from the `dataset_builder.MultiThreadedDatasetBuilder` class. First, install the multithreaded dataset builder by running `pip install .` in this directory. Each dataset converter may have additional requirements that they specify using a `requirements.txt`.
To build a particular dataset, `cd` into its corresponding directory and run `CUDA_VISIBLE_DEVICES="" tfds build --manual_dir <path_to_raw_data>`. See individual dataset documentation for how to obtain the raw data. You may also want to modify settings inside the `<dataset_name>_dataset_builder.py` file (e.g., `NUM_WORKERS` and `CHUNKSIZE`.)