Unverified Commit efae6d9d authored by Ignacio Pickering's avatar Ignacio Pickering Committed by GitHub
Browse files

improve documentation and fix buggy behavior of shuffle() and cache() (#570)

* improve documentation and fix buggy behavior of shuffle() and cache()

* remove warnings
parent 269344b4
......@@ -4,8 +4,10 @@
The `torchani.data.load(path)` creates an iterable of raw data,
where species are strings, and coordinates are numpy ndarrays.
You can transform these iterable by using transformations.
To do transformation, just do `it.transformation_name()`.
You can transform this iterable by using transformations.
To do a transformation, call `it.transformation_name()`. This
will return an iterable that may be cached depending on the specific
transformation.
Available transformations are listed below:
......@@ -31,17 +33,37 @@ Available transformations are listed below:
specified by species_order. By default the function orders by atomic
number if no extra argument is provided, but a specific order may be requested.
- `remove_outliers`
- `shuffle`
- `remove_outliers` removes some outlier energies from the dataset if present.
- `shuffle` shuffles the provided dataset. Note that if the dataset is
not cached (i.e. it lives in the disk and not in memory) then this method
will cache it before shuffling. This may take time and memory depending on
the dataset size. This method may be used before splitting into validation/training
shuffle all molecules in the dataset, and ensure a uniform sampling from
the initial dataset, and it can also be used during training on a cached
dataset of batches to shuffle the batches.
- `cache` cache the result of previous transformations.
- `collate` pad the dataset, convert it to tensor, and stack them
together to get a batch. `collate` uses a default padding dictionary
If the input is already cached this does nothing.
- `collate` creates batches and pads the atoms of all molecules in each batch
with dummy atoms, then converts each batch to tensor. `collate` uses a
default padding dictionary:
``{'species': -1, 'coordinates': 0.0, 'forces': 0.0, 'energies': 0.0}`` for
padding, but a custom padding dictionary can be passed as an optional
parameter, which overrides this default padding.
parameter, which overrides this default padding. Note that this function
returns a generator, it doesn't cache the result in memory.
- `pin_memory` copy the tensor to pinned memory so that later transfer
to cuda could be faster.
- `pin_memory` copies the tensor to pinned (page-locked) memory so that later transfer
to cuda devices can be done faster.
you can also use `split` to split the iterable to pieces. use `split` as:
.. code-block:: python
it.split(ratio1, ratio2, None)
where None in the end indicate that we want to use all of the rest.
Note that orderings used in :class:`torchani.utils.ChemicalSymbolsToInts` and
:class:`torchani.nn.SpeciesConverter` should be consistent with orderings used
......@@ -53,13 +75,6 @@ with hydrogen, nitrogen and bromine always use ['H', 'N', 'Br'] and never ['N',
ordering, mainly due to backwards compatibility and to fully custom atom types,
but doing so is NOT recommended, since it is very error prone.
you can also use `split` to split the iterable to pieces. use `split` as:
.. code-block:: python
it.split(ratio1, ratio2, None)
where the None in the end indicate that we want to use all of the the rest
Example:
......@@ -237,6 +252,9 @@ class Transformations:
@staticmethod
def shuffle(reenterable_iterable):
if isinstance(reenterable_iterable, list):
list_ = reenterable_iterable
else:
list_ = list(reenterable_iterable)
del reenterable_iterable
gc.collect()
......@@ -245,6 +263,8 @@ class Transformations:
@staticmethod
def cache(reenterable_iterable):
if isinstance(reenterable_iterable, list):
return reenterable_iterable
ret = list(reenterable_iterable)
del reenterable_iterable
gc.collect()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment