improve documentation and fix buggy behavior of shuffle() and cache() (#570)

* improve documentation and fix buggy behavior of shuffle() and cache() * remove warnings

improve documentation and fix buggy behavior of shuffle() and cache() (#570)
* improve documentation and fix buggy behavior of shuffle() and cache() * remove warnings
efae6d9d · Ignacio Pickering · GitHub · 269344b4 · efae6d9d
Unverified Commit efae6d9d authored Feb 17, 2021 by Ignacio Pickering Committed by GitHub Feb 17, 2021
Show whitespace changes
Inline Side-by-side

Showing with 40 additions and 20 deletions

torchani/data/__init__.py torchani/data/__init__.py +40 -20

No files found.
--- a/torchani/data/__init__.py
+++ b/torchani/data/__init__.py
@@ -4,8 +4,10 @@
 The `torchani.data.load(path)` creates an iterable of raw data,
 where species are strings, and coordinates are numpy ndarrays.

-You can transform these iterable by using transformations.
-To do transformation, just do `it.transformation_name()`.
+You can transform this iterable by using transformations.
+To do a transformation, call `it.transformation_name()`. This
+will return an iterable that may be cached depending on the specific
+transformation.

 Available transformations are listed below:

@@ -31,17 +33,37 @@ Available transformations are listed below:
    specified by species_order. By default the function orders by atomic
    number if no extra argument is provided, but a specific order may be requested.

- `remove_outliers`
- `shuffle`
+- `remove_outliers` removes some outlier energies from the dataset if present.
+
+- `shuffle` shuffles the provided dataset. Note that if the dataset is
+    not cached (i.e. it lives in the disk and not in memory) then this method
+    will cache it before shuffling. This may take time and memory depending on
+    the dataset size. This method may be used before splitting into validation/training
+    shuffle all molecules in the dataset, and ensure a uniform sampling from
+    the initial dataset, and it can also be used during training on a cached
+    dataset of batches to shuffle the batches.
+
 - `cache` cache the result of previous transformations.
- `collate` pad the dataset, convert it to tensor, and stack them
-    together to get a batch. `collate` uses a default padding dictionary
+    If the input is already cached this does nothing.
+
+- `collate` creates batches and pads the atoms of all molecules in each batch
+    with dummy atoms, then converts each batch to tensor. `collate` uses a
+    default padding dictionary:
    ``{'species': -1, 'coordinates': 0.0, 'forces': 0.0, 'energies': 0.0}`` for
    padding, but a custom padding dictionary can be passed as an optional
-    parameter, which overrides this default padding.
+    parameter, which overrides this default padding. Note that this function
+    returns a generator, it doesn't cache the result in memory.

- `pin_memory` copy the tensor to pinned memory so that later transfer
-    to cuda could be faster.
+- `pin_memory` copies the tensor to pinned (page-locked) memory so that later transfer
+    to cuda devices can be done faster.
+
+you can also use `split` to split the iterable to pieces. use `split` as:
+
+.. code-block:: python
+
+    it.split(ratio1, ratio2, None)
+
+where None in the end indicate that we want to use all of the rest.

 Note that orderings used in :class:`torchani.utils.ChemicalSymbolsToInts` and
 :class:`torchani.nn.SpeciesConverter` should be consistent with orderings used
@@ -53,13 +75,6 @@ with hydrogen, nitrogen and bromine always use ['H', 'N', 'Br'] and never ['N',
 ordering, mainly due to backwards compatibility and to fully custom atom types,
 but doing so is NOT recommended, since it is very error prone.

-you can also use `split` to split the iterable to pieces. use `split` as:
-
-.. code-block:: python
-
-    it.split(ratio1, ratio2, None)
-
-where the None in the end indicate that we want to use all of the the rest

 Example:

@@ -237,6 +252,9 @@ class Transformations:

    @staticmethod
    def shuffle(reenterable_iterable):
+        if isinstance(reenterable_iterable, list):
+            list_ = reenterable_iterable
+        else:
            list_ = list(reenterable_iterable)
            del reenterable_iterable
            gc.collect()
@@ -245,6 +263,8 @@ class Transformations:

    @staticmethod
    def cache(reenterable_iterable):
+        if isinstance(reenterable_iterable, list):
+            return reenterable_iterable
        ret = list(reenterable_iterable)
        del reenterable_iterable
        gc.collect()