Unverified Commit 65d3a87b authored by Philip Meier's avatar Philip Meier Committed by GitHub
Browse files

add instructions how to add tests for prototype datasets (#5666)

parent be462be3
# How to add new built-in prototype datasets # How to add new built-in prototype datasets
As the name implies, the datasets are still in a prototype state and thus As the name implies, the datasets are still in a prototype state and thus subject to rapid change. This in turn means
subject to rapid change. This in turn means that this document will also change that this document will also change a lot.
a lot.
If you hit a blocker while adding a dataset, please have a look at another If you hit a blocker while adding a dataset, please have a look at another similar dataset to see how it is implemented
similar dataset to see how it is implemented there. If you can't resolve it there. If you can't resolve it yourself, feel free to send a draft PR in order for us to help you out.
yourself, feel free to send a draft PR in order for us to help you out.
Finally, `from torchvision.prototype import datasets` is implied below. Finally, `from torchvision.prototype import datasets` is implied below.
## Implementation ## Implementation
Before we start with the actual implementation, you should create a module in Before we start with the actual implementation, you should create a module in `torchvision/prototype/datasets/_builtin`
`torchvision/prototype/datasets/_builtin` that hints at the dataset you are that hints at the dataset you are going to add. For example `caltech.py` for `caltech101` and `caltech256`. In that
going to add. For example `caltech.py` for `caltech101` and `caltech256`. In module create a class that inherits from `datasets.utils.Dataset` and overwrites at minimum three methods that will be
that module create a class that inherits from `datasets.utils.Dataset` and discussed in detail below:
overwrites at minimum three methods that will be discussed in detail below:
```python ```python
from typing import Any, Dict, List from typing import Any, Dict, List
...@@ -39,50 +36,39 @@ class MyDataset(Dataset): ...@@ -39,50 +36,39 @@ class MyDataset(Dataset):
### `_make_info(self)` ### `_make_info(self)`
The `DatasetInfo` carries static information about the dataset. There are two The `DatasetInfo` carries static information about the dataset. There are two required fields:
required fields:
- `name`: Name of the dataset. This will be used to load the dataset with - `name`: Name of the dataset. This will be used to load the dataset with `datasets.load(name)`. Should only contain
`datasets.load(name)`. Should only contain lowercase characters. lowercase characters.
There are more optional parameters that can be passed: There are more optional parameters that can be passed:
- `dependencies`: Collection of third-party dependencies that are needed to load - `dependencies`: Collection of third-party dependencies that are needed to load the dataset, e.g. `("scipy",)`. Their
the dataset, e.g. `("scipy",)`. Their availability will be automatically availability will be automatically checked if a user tries to load the dataset. Within the implementation, import
checked if a user tries to load the dataset. Within the implementation, import
these packages lazily to avoid missing dependencies at import time. these packages lazily to avoid missing dependencies at import time.
- `categories`: Sequence of human-readable category names for each label. The - `categories`: Sequence of human-readable category names for each label. The index of each category has to match the
index of each category has to match the corresponding label returned in the corresponding label returned in the dataset samples.
dataset samples. [See [See below](#how-do-i-handle-a-dataset-that-defines-many-categories) how to handle cases with many categories.
below](#how-do-i-handle-a-dataset-that-defines-many-categories) how to handle - `valid_options`: Configures valid options that can be passed to the dataset. It should be `Dict[str, Sequence[Any]]`.
cases with many categories. The options are accessible through the `config` namespace in the other two functions. First value of the sequence is
- `valid_options`: Configures valid options that can be passed to the dataset. taken as default if the user passes no option to `torchvision.prototype.datasets.load()`.
It should be `Dict[str, Sequence[Any]]`. The options are accessible through
the `config` namespace in the other two functions. First value of the sequence
is taken as default if the user passes no option to
`torchvision.prototype.datasets.load()`.
## `resources(self, config)` ## `resources(self, config)`
Returns `List[datasets.utils.OnlineResource]` of all the files that need to be Returns `List[datasets.utils.OnlineResource]` of all the files that need to be present locally before the dataset with a
present locally before the dataset with a specific `config` can be build. The specific `config` can be build. The download will happen automatically.
download will happen automatically.
Currently, the following `OnlineResource`'s are supported: Currently, the following `OnlineResource`'s are supported:
- `HttpResource`: Used for files that are directly exposed through HTTP(s) and - `HttpResource`: Used for files that are directly exposed through HTTP(s) and only requires the URL.
only requires the URL. - `GDriveResource`: Used for files that are hosted on GDrive and requires the GDrive ID as well as the `file_name`.
- `GDriveResource`: Used for files that are hosted on GDrive and requires the - `ManualDownloadResource`: Used files are not publicly accessible and requires instructions how to download them
GDrive ID as well as the `file_name`. manually. If the file does not exist, an error will be raised with the supplied instructions.
- `ManualDownloadResource`: Used files are not publicly accessible and requires - `KaggleDownloadResource`: Used for files that are available on Kaggle. This inherits from `ManualDownloadResource`.
instructions how to download them manually. If the file does not exist, an
error will be raised with the supplied instructions. Although optional in general, all resources used in the built-in datasets should comprise
- `KaggleDownloadResource`: Used for files that are available on Kaggle. This [SHA256](https://en.wikipedia.org/wiki/SHA-2) checksum for security. It will be automatically checked after the
inherits from `ManualDownloadResource`. download. You can compute the checksum with system utilities e.g `sha256-sum`, or this snippet:
Although optional in general, all resources used in the built-in datasets should
comprise [SHA256](https://en.wikipedia.org/wiki/SHA-2) checksum for security. It
will be automatically checked after the download. You can compute the checksum
with system utilities e.g `sha256-sum`, or this snippet:
```python ```python
import hashlib import hashlib
...@@ -97,61 +83,123 @@ def sha256sum(path, chunk_size=1024 * 1024): ...@@ -97,61 +83,123 @@ def sha256sum(path, chunk_size=1024 * 1024):
### `_make_datapipe(resource_dps, *, config)` ### `_make_datapipe(resource_dps, *, config)`
This method is the heart of the dataset, where we transform the raw data into This method is the heart of the dataset, where we transform the raw data into a usable form. A major difference compared
a usable form. A major difference compared to the current stable datasets is to the current stable datasets is that everything is performed through `IterDataPipe`'s. From the perspective of someone
that everything is performed through `IterDataPipe`'s. From the perspective of that is working with them rather than on them, `IterDataPipe`'s behave just as generators, i.e. you can't do anything
someone that is working with them rather than on them, `IterDataPipe`'s behave with them besides iterating.
just as generators, i.e. you can't do anything with them besides iterating.
Of course, there are some common building blocks that should suffice in 95% of Of course, there are some common building blocks that should suffice in 95% of the cases. The most used are:
the cases. The most used are:
- `Mapper`: Apply a callable to every item in the datapipe. - `Mapper`: Apply a callable to every item in the datapipe.
- `Filter`: Keep only items that satisfy a condition. - `Filter`: Keep only items that satisfy a condition.
- `Demultiplexer`: Split a datapipe into multiple ones. - `Demultiplexer`: Split a datapipe into multiple ones.
- `IterKeyZipper`: Merge two datapipes into one. - `IterKeyZipper`: Merge two datapipes into one.
All of them can be imported `from torchdata.datapipes.iter`. In addition, use All of them can be imported `from torchdata.datapipes.iter`. In addition, use `functools.partial` in case a callable
`functools.partial` in case a callable needs extra arguments. If the provided needs extra arguments. If the provided `IterDataPipe`'s are not sufficient for the use case, it is also not complicated
`IterDataPipe`'s are not sufficient for the use case, it is also not complicated
to add one. See the MNIST or CelebA datasets for example. to add one. See the MNIST or CelebA datasets for example.
`make_datapipe()` receives `resource_dps`, which is a list of datapipes that has `make_datapipe()` receives `resource_dps`, which is a list of datapipes that has a 1-to-1 correspondence with the return
a 1-to-1 correspondence with the return value of `resources()`. In case of value of `resources()`. In case of archives with regular suffixes (`.tar`, `.zip`, ...), the datapipe will contain
archives with regular suffixes (`.tar`, `.zip`, ...), the datapipe will contain tuples comprised of the path and the handle for every file in the archive. Otherwise the datapipe will only contain one
tuples comprised of the path and the handle for every file in the archive. of such tuples for the file specified by the resource.
Otherwise the datapipe will only contain one of such tuples for the file
specified by the resource. Since the datapipes are iterable in nature, some datapipes feature an in-memory buffer, e.g. `IterKeyZipper` and
`Grouper`. There are two issues with that: 1. If not used carefully, this can easily overflow the host memory, since
most datasets will not fit in completely. 2. This can lead to unnecessarily long warm-up times when data is buffered
that is only needed at runtime.
Thus, all buffered datapipes should be used as early as possible, e.g. zipping two datapipes of file handles rather than
trying to zip already loaded images.
There are two special datapipes that are not used through their class, but through the functions `hint_sharding` and
`hint_shuffling`. As the name implies they only hint part in the datapipe graph where sharding and shuffling should take
place, but are no-ops by default. They can be imported from `torchvision.prototype.datasets.utils._internal` and are
required in each dataset.
Finally, each item in the final datapipe should be a dictionary with `str` keys. There is no standardization of the
names (yet!).
## Tests
To test the dataset implementation, you usually don't need to add any tests, but need to provide a mock-up of the data.
This mock-up should resemble the original data as close as necessary, while containing only few examples.
To do this, add a new function in [`test/builtin_dataset_mocks.py`](../../../../test/builtin_dataset_mocks.py) with the
same name as you have defined in `_make_config()` (if the name includes hyphens `-`, replace them with underscores `_`)
and decorate it with `@register_mock`:
```py
# this is defined in torchvision/prototype/datasets/_builtin
class MyDataset(Dataset):
def _make_info(self) -> DatasetInfo:
return DatasetInfo(
"my-dataset",
...
)
@register_mock
def my_dataset(info, root, config):
...
```
Since the datapipes are iterable in nature, some datapipes feature an in-memory The function receives three arguments:
buffer, e.g. `IterKeyZipper` and `Grouper`. There are two issues with that: 1.
If not used carefully, this can easily overflow the host memory, since most
datasets will not fit in completely. 2. This can lead to unnecessarily long
warm-up times when data is buffered that is only needed at runtime.
Thus, all buffered datapipes should be used as early as possible, e.g. zipping - `info`: The return value of `_make_info()`.
two datapipes of file handles rather than trying to zip already loaded images. - `root`: A [`pathlib.Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) of a folder, in which the data
needs to be placed.
- `config`: The configuration to generate the data for. This is the same value that `_make_datapipe()` receives.
There are two special datapipes that are not used through their class, but The function should generate all files that are needed for the current `config`. Each file should be complete, e.g. if
through the functions `hint_sharding` and `hint_shuffling`. As the name implies the dataset only has a single archive that contains multiple splits, you need to generate all regardless of the current
they only hint part in the datapipe graph where sharding and shuffling should `config`. Although this seems odd at first, this is important. Consider the following original data setup:
take place, but are no-ops by default. They can be imported from
`torchvision.prototype.datasets.utils._internal` and are required in each
dataset.
Finally, each item in the final datapipe should be a dictionary with `str` keys. ```
There is no standardization of the names (yet!). root
├── test
│ ├── test_image0.jpg
│ ...
└── train
├── train_image0.jpg
...
```
For map-style datasets (like the one currently in `torchvision.datasets`), one explicitly selects the files they want to
load. For example, something like `(root / split).iterdir()` works fine even if only the specific split folder is
present. With iterable-style datasets though, we get something like `root.iterdir()` from `resource_dps` in
`_make_datapipe()` and need to manually `Filter` it to only keep the files we want. If we would only generate the data
for the current `config`, the test would also pass if the dataset is missing the filtering, but would fail on the real
data.
For datasets that are ported from the old API, we already have some mock data in
[`test/test_datasets.py`](../../../../test/test_datasets.py). You can find the test case corresponding test case there
and have a look at the `inject_fake_data` function. There are a few differences though:
- `tmp_dir` corresponds to `root`, but is a `str` rather than a
[`pathlib.Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path). Thus, you often see something like
`folder = pathlib.Path(tmp_dir)`. This is not needed.
- Although both parameters are called `config`, the value in the new tests is a namespace. Thus, please use `config.foo`
over `config["foo"]` to enhance readability.
- The data generated by `inject_fake_data` was supposed to be in an extracted state. This is no longer the case for the
new mock-ups. Thus, you need to use helper functions like `make_zip` or `make_tar` to actually generate the files
specified in the dataset.
- As explained in the paragraph above, the generated data is often "incomplete" and only valid for given the config.
Make sure you follow the instructions above.
The function should return an integer indicating the number of samples in the dataset for the current `config`.
Preferably, this number should be different for different `config`'s to have more confidence in the dataset
implementation.
Finally, you can run the tests with `pytest test/test_prototype_builtin_datasets.py -k {name}`.
## FAQ ## FAQ
### How do I start? ### How do I start?
Get the skeleton of your dataset class ready with all 3 methods. For Get the skeleton of your dataset class ready with all 3 methods. For `_make_datapipe()`, you can just do
`_make_datapipe()`, you can just do `return resources_dp[0]` to get started. `return resources_dp[0]` to get started. Then import the dataset class in
Then import the dataset class in `torchvision/prototype/datasets/_builtin/__init__.py`: this will automatically register the dataset and it will be
`torchvision/prototype/datasets/_builtin/__init__.py`: this will automatically instantiable via `datasets.load("mydataset")`. On a separate script, try something like
register the dataset and it will be instantiable via
`datasets.load("mydataset")`. On a separate script, try something like
```py ```py
from torchvision.prototype import datasets from torchvision.prototype import datasets
...@@ -163,35 +211,27 @@ for sample in dataset: ...@@ -163,35 +211,27 @@ for sample in dataset:
# Or you can also inspect the sample in a debugger # Or you can also inspect the sample in a debugger
``` ```
This will give you an idea of what the first datapipe in `resources_dp` This will give you an idea of what the first datapipe in `resources_dp` contains. You can also do that with
contains. You can also do that with `resources_dp[1]` or `resources_dp[2]` `resources_dp[1]` or `resources_dp[2]` (etc.) if they exist. Then follow the instructions above to manipulate these
(etc.) if they exist. Then follow the instructions above to manipulate these
datapipes and return the appropriate dictionary format. datapipes and return the appropriate dictionary format.
### How do I handle a dataset that defines many categories? ### How do I handle a dataset that defines many categories?
As a rule of thumb, `datasets.utils.DatasetInfo(..., categories=)` should only As a rule of thumb, `datasets.utils.DatasetInfo(..., categories=)` should only be set directly for ten categories or
be set directly for ten categories or fewer. If more categories are needed, you fewer. If more categories are needed, you can add a `$NAME.categories` file to the `_builtin` folder in which each line
can add a `$NAME.categories` file to the `_builtin` folder in which each line specifies a category. If `$NAME` matches the name of the dataset (which it definitively should!) it will be
specifies a category. If `$NAME` matches the name of the dataset (which it automatically loaded if `categories=` is not set.
definitively should!) it will be automatically loaded if `categories=` is not
set. In case the categories can be generated from the dataset files, e.g. the dataset follows an image folder approach where
each folder denotes the name of the category, the dataset can overwrite the `_generate_categories` method. It gets
In case the categories can be generated from the dataset files, e.g. the dataset passed the `root` path to the resources, but they have to be manually loaded, e.g.
follows an image folder approach where each folder denotes the name of the `self.resources(config)[0].load(root)`. The method should return a sequence of strings representing the category names.
category, the dataset can overwrite the `_generate_categories` method. It gets To generate the `$NAME.categories` file, run `python -m torchvision.prototype.datasets.generate_category_files $NAME`.
passed the `root` path to the resources, but they have to be manually loaded,
e.g. `self.resources(config)[0].load(root)`. The method should return a sequence
of strings representing the category names. To generate the `$NAME.categories`
file, run `python -m torchvision.prototype.datasets.generate_category_files
$NAME`.
### What if a resource file forms an I/O bottleneck? ### What if a resource file forms an I/O bottleneck?
In general, we are ok with small performance hits of iterating archives rather In general, we are ok with small performance hits of iterating archives rather than their extracted content. However, if
than their extracted content. However, if the performance hit becomes the performance hit becomes significant, the archives can still be decompressed or extracted. To do this, the
significant, the archives can still be decompressed or extracted. To do this, `decompress: bool` and `extract: bool` flags can be used for every `OnlineResource` individually. For more complex
the `decompress: bool` and `extract: bool` flags can be used for every cases, each resource also accepts a `preprocess` callable that gets passed a `pathlib.Path` of the raw file and should
`OnlineResource` individually. For more complex cases, each resource also return `pathlib.Path` of the preprocessed file or folder.
accepts a `preprocess` callable that gets passed a `pathlib.Path` of the raw
file and should return `pathlib.Path` of the preprocessed file or folder.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment