Initial commit

f356f546 · maming · f356f546 · f356f546 · f356f546 · f356f546
Commit f356f546 authored Feb 04, 2026 by maming
20 changed files
--- a/Megatron-Energon/docs/source/_static/favicon-16x16.png
+++ b/Megatron-Energon/docs/source/_static/favicon-16x16.png
--- a/Megatron-Energon/docs/source/_static/favicon-32x32.png
+++ b/Megatron-Energon/docs/source/_static/favicon-32x32.png
--- a/Megatron-Energon/docs/source/_static/favicon.ico
+++ b/Megatron-Energon/docs/source/_static/favicon.ico
--- a/Megatron-Energon/docs/source/_static/site.webmanifest
+++ b/Megatron-Energon/docs/source/_static/site.webmanifest
+{"name":"Megatron Energon Dataloader Documentation","short_name":"Megatron Energon","icons":[{"src":"/android-chrome-192x192.png","sizes":"192x192","type":"image/png"},{"src":"/android-chrome-512x512.png","sizes":"512x512","type":"image/png"}],"theme_color":"#ED467A","background_color":"#411046","display":"standalone"}
\ No newline at end of file
--- a/Megatron-Energon/docs/source/_templates/favicon.html
+++ b/Megatron-Energon/docs/source/_templates/favicon.html
+<link rel="apple-touch-icon" sizes="180x180" href="{{ pathto('_static/apple-touch-icon.png', 1) }}">
+<link rel="icon" type="image/png" sizes="32x32" href="{{ pathto('_static/favicon-32x32.png', 1) }}">
+<link rel="icon" type="image/png" sizes="16x16" href="{{ pathto('_static/favicon-16x16.png', 1) }}">
+<link rel="shortcut icon" href="{{ pathto('_static/favicon.ico', 1) }}">
+<link rel="manifest" href="{{ pathto('_static/site.webmanifest', 1) }}">
\ No newline at end of file
--- a/Megatron-Energon/docs/source/_templates/layout.html
+++ b/Megatron-Energon/docs/source/_templates/layout.html
+{# Extend the base theme layout #}
+{% extends "!layout.html" %}
+{# Add custom favicon links to the head section #}
+{% block extrahead %}
+    {{ super() }}
+    {% include "favicon.html" %}
+{% endblock %}
--- a/Megatron-Energon/docs/source/advanced/crude_datasets.md
+++ b/Megatron-Energon/docs/source/advanced/crude_datasets.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+(crude-data)=
+# Crude Datasets and Auxiliary Data
+As explained in [](sample-loading), the user has several options to choose how energon converts the raw (crude) data inside the tar files into Python sample objects (instances of a `Sample` dataclass) during loading.
+When using crude datasets, this conversion happens through so-called "cookers", i.e. user-defined functions defined in the task encoder, as explained below.
+In this case, the dataset on disk will specify neither the resulting sample type nor a sample loader for conversion, hence we call it "crude".
+All of the conversion will happen in the user's code base.
+## Setting Up a Crude Dataset with Cookers
+Let's try it. 
+When you run `energon prepare` to prepare your dataset, you can pick "Crude sample" as the sample type.
+If you already have an existing energon-compliant data set, you can modify it as follows (or create a copy of your `dataset.yaml` specifically for this use case and keep the original).
+Original `dataset.yaml`:
+```yaml
+sample_type:
+  __module__: megatron.energon
+  __class__: OCRSample
+sample_loader: sample_loader.py:sample_loader
+part_filter: sample_loader.py:part_filter
+```
+New `dataset.yaml` (or `crude.yaml` of you like):
+```yaml
+__module__: megatron.energon
+__class__: CrudeWebdataset
+subflavors:
+  example_crude_selector: foobar
+```
+As shown above, [subflavors](sect-subflavors) can be assigned directly in the dataset yaml, however we recommend
+to define them in the [Metadataset](../basic/metadataset) instead.
+Subflavors can be used to differentiate the data origin when you mix different crude datasets or crude with non-crude data using a [Metadataset](../basic/metadataset).
+You can then use the subflavors to determine how each sample shall be processed, as explained below.
+Let's see how we set up our [Task Encoder](../basic/task_encoder) to cook the crude samples and turn them into real samples like {py:class}`TextSample <megatron.energon.TextSample>`.
+A crude sample is essentially a dictionary ({py:class}`CrudeSample <megatron.energon.CrudeSample>` inherits from `dict`) and the loader will look for a {py:class}`Cooker <megatron.energon.Cooker>` that can handle and convert your crude samples.
+You can provide cookers as follows:
+```python
+from megatron.energon import Cooker, basic_sample_keys
+# ...
+# We recommend to place your cooker functions in a separate file (esp. if they are larger)
+def cook_text(sample: dict) -> TextSample:
+    return TextSample(
+        **basic_sample_keys(sample),
+        text=f">{sample['txt'].decode()}<",
+    )
+class MyTaskEncoder(DefaultTaskEncoder[TextSample, TextSample, TextRawBatch, TextBatch]):
+    cookers = [
+        Cooker(cook_text, has_subflavors={"example_crude_selector": "foobar"}),
+        Cooker(...)  # other cookers for other crude data if needed
+    ]
+    # ...
+```
+In the example above, the cooker acts on all crude samples that have a subflavor `example_crude_selector` set to `foobar`.
+If you leave out the `has_subflavors` argument, the cooker will apply to any sample.
+The cooker will convert the dictionary to a {py:class}`TextSample <megatron.energon.TextSample>` by decoding the raw bytes and decorating the text with some nice angle brackets.
+Probably you noticed the {py:meth}`basic_sample_keys <megatron.energon.task_encoder.cooking.basic_sample_keys>` helper that we inserted.
+All it does, is to forward the key, restore key and flavors from the dict to the real sample. You will always need to forward these, or your dataset will not be restorable.
+In a real use-case you will want to do a lot more here and we recommend keeping the cook methods in separate files and importing them where you define your TaskEncoder.
+(aux-data)=
+## Auxiliary Data for Polylithic Datasets
+Using a crude dataset allows you to benefit from two other features of energon:
+* Auxiliary Data
+* Cache Pools
+Both of which are often used in combination. A typical use case is online packing.
+An **auxiliary data source** is an additional data source that supports random access and can be used to load data on-demand using its filename.
+It is typically used with polylithic datasets where you have one primary dataset that contains only the text-based sample data
+and one or more additional auxiliary data sources that contain the (larger) media data such as images or videos.
+An auxiliary data source can be either
+* Another energon-prepared WebDataset
+* A folder on the local or a remote file system
+You can specify it in your [metadataset](../basic/metadataset) yaml as follows (look at the `aux:` section)
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    path: ./my_pimary_ds
+    aux:
+      foo_bar_source: ./aux_ds123
+      fs_source: filesystem://./images
+      fs_source_abs: filesystem:///absolute/path/to/images
+      remote_source: msc://mybucket/path/ds
+    subflavors:
+      crude_type: my_dual_aux_example
+```
+The format is like
+```yaml
+aux:
+    NAME: PATH_OR_URL
+    NAME: PATH_OR_URL
+    ...
+```
+You can specify multiple aux sources each of which can be one of
+* Relative or absolute path to a local prepared energon dataset
+* Relative or absolute path to a local folder (use the prefix `filesystem://`)
+* Path to a remote prepared energon dataset (use prefix `msc://`)
+* *[Planned future feature]*: Path to a remote folder (use prefix `filesystem+msc://`)
+In your code, the cooker will automatically receive a {py:class}`FileStore <megatron.energon.FileStore>` reference to the data source as a keyword argument:
+```python
+from megatron.energon import FileStore
+# ...
+def cook_text(sample: dict, foo_bar_source: FileStore) -> TextSample:
+    additional_text = foo_bar_source.get(sample['add_txt_fname'])
+    return TextSample(
+        **basic_sample_keys(sample),
+        text=f"{sample['txt'].decode()} + {additional_text.decode()}",
+    )
+# ...
+```
+You can use multiple sources. You'll have to specify a cooker argument for each source that was defined in the metadataset.
+For easier debugging, you should always keep track of all the sources you used. The `get` method takes care of this if you pass it the sample like this:
+```python
+additional_text = foo_bar_source.get(sample['add_txt_fname'], sample)
+```
+This will update the sample-internal `__sources__` list with the aux dataset you used.
+If you want, you can even use your primary dataset as an auxiliary dataset and look up files by name, yes! If you want to do that, you specify it in the cooker decorator and add an arg:
+```python
+from megatron.energon import cooker, FileStore
+# ...
+@cooker(need_primary=True)
+def cook_text(sample: dict, primary: FileStore, foo_bar_source: FileStore) -> TextSample:
+    # ...
+```
+You can then retrieve files by their names from the primary dataset.
+(cache-pools)=
+## Cache Pools
+Cache pools allow the user to defer the data transfer if the content will be needed at some point in the future but not immediately.
+This is only needed if the media data is rather large and does not reside on a local disk, but rather on a network file system (e.g. lustre) or a remote file system (e.g. object storage).
+Cache pools are especially beneficial if you are using buffers in your pipeline, like a shuffle buffer or a packing buffer. For example, when using [online packing](../advanced/packing), we may need to keep a buffer of several thousand samples to optimize for the best packing,
+but we cannot keep several thousand images in memory, also we don't need the actual image content to optimize the packing.
+Hence we will use auxiliary datasets as explained above.
+However, at the time of filling the buffer, **we already know** that we **will need the image content in the future**, so cache pools can be used to **prefetch it in the background**.
+Initially we want to load some information about the sample and its image but not the actual image pixels.
+Later, when the packing is computed, we need to retrieve the pixel values.
+In practice, this means the cooker will use a cache pool to queue the data retrieval from an auxiliary data source and obtain a lazy object (a handle to this future data). In a later stage (like {py:meth}`pack_selected_samples <megatron.energon.TaskEncoder.pack_selected_samples>`), the lazy object can be used to retrieve the content.
+Ideally, in the mean-time, the cache pool has already downloaded the data to a local SSD.
+### Using a Cache Pool
+When calling {py:func}`get_savable_loader <megatron.energon.get_savable_loader>`,
+we pass a cache pool as an additional argument:
+```python
+from megatron.energon import FileStoreCachePool
+# ...
+loader = get_savable_loader(
+    my_ds,
+    ...,
+    cache_pool=FileStoreCachePool(
+        parent_cache_dir="/local_scratch/cache",
+        num_workers=1,
+    ),
+)
+```
+Then we tell the cooker decorator that we need access to the cache pool and use it to get a lazy object:
+```python
+from megatron.energon import cooker, FileStore, CachePool
+from megatron.energon.av import AVDecoder
+# ...
+@edataclass
+class TextVideoSample(Sample):
+    text: str
+    video: Lazy[AVDecoder]
+@edataclass
+class PackedTextVideoSample(Sample):
+    text: str
+    video: torch.Tensor
+@cooker(need_cache=True)
+def cook_video(sample: dict, video_source: FileStore, cache: CachePool) -> TextVideoSample:
+    # Previous non-cached version:
+    # video = video_source.get(sample['video_path'])
+    # Cached version:
+    video = cache.get_lazy(foo_bar_source, sample['video_path'])
+    return TextVideoSample(
+        **basic_sample_keys(sample),
+        text=sample['txt'].decode(),
+        video=video,  # Pass the lazy object on
+    )
+```
+Later down the data processing pipeline, we can retrieve the data, for example here:
+```python
+@stateless
+def pack_selected_samples(self, samples: List[TextVideoSample]) -> PackedTextVideoSample:
+    # Get the real object now:
+    video_data: AVDecoder = samples[0].video.get(samples[0])
+    return TextVideoSample.derive_from(
+        samples[0],
+        text=samples[0].txt,
+        video=video_data.get_video_clips([(0, 1), (19, 20)])[0],
+    )
+```
+There is a second option, e.g. if you want to combine a monolithic dataset with packing and caching: Use `cache.to_cache()` to move already loaded data to the cache:
+```python
+@cooker(need_cache=True)
+def cook_video_monolithic(sample: dict, cache: CachePool) -> TextVideoSample:
+    # Previous non-cached version:
+    # video: AVDecoder = sample['mp4']
+    # Move the video to the cache, retrieve it later when it is needed again.
+    video: Lazy[AVDecoder] = cache.to_cache(
+        sample['mp4'],
+        sample['__key__'] + ".mp4",  # Just a name for debugging
+    )
+    return TextVideoSample(
+        **basic_sample_keys(sample),
+        text=sample['txt'].decode(),
+        video=video,  # Pass the lazy object on
+    )
+```
\ No newline at end of file
--- a/Megatron-Energon/docs/source/advanced/custom_blending.md
+++ b/Megatron-Energon/docs/source/advanced/custom_blending.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Customized Blending
+In your Task Encoder you could customize the blend of datasets by overriding the `build_train_datasets` method as shown below.
+```{warning}
+This interface is not stable and may be subject of changes quite often for new features we add. So if you change
+how the datasets are plugged together, consider that this may have to be adapted to future changes.
+```
+```py
+class CaptioningTaskEncoder(
+    DefaultTaskEncoder[CaptioningSample, CaptioningSample, CaptioningRawBatch, CaptioningBatch]
+):
+    ...
+    def build_train_datasets(
+        self,
+        *,
+        datasets: List[Tuple[BaseCoreDatasetFactory[T_sample], float]],
+        worker_config: WorkerConfig,
+        batch_size: Optional[int],
+        batch_drop_last: bool = False,
+        packing_buffer_size: Optional[int] = None,
+        virtual_epoch_length: int = 0,
+        shuffle_buffer_size: Optional[int] = None,
+    ) -> SavableDataset[T_batch]:
+        # The default implementation uses MixDataset, which mixes the datasets according to their weights
+        # This could be customized, e.g. to batch the datasets first (i.e. each batch only contains data from a single datset)
+        # and then blend, which would yield the same distribution.
+        dataset = BlendDataset(
+            *datasets,
+            worker_config=worker_config,
+        )
+        # Build batches from blended samples
+        dataset = self.build_batch(
+            dataset,
+            batch_size=batch_size,
+            batch_drop_last=batch_drop_last,
+            worker_config=worker_config,
+        )
+        # Optionally epochize
+        if virtual_epoch_length > 0:
+            dataset = EpochizeDataset(
+                dataset,
+                length=virtual_epoch_length,
+                worker_config=worker_config,
+            )
+        return dataset
+```
\ No newline at end of file
--- a/Megatron-Energon/docs/source/advanced/custom_sample_loader.md
+++ b/Megatron-Energon/docs/source/advanced/custom_sample_loader.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+(custom-sample-loader)=
+# Custom Sample Loader
+```{warning}
+The custom sample loader is a legacy feature and using [crude datasets](../advanced/crude_datasets.md) with cookers is usually the preferred way.
+This feature might be deprecated at some point in the future.
+```
+Instead of using a `field_map` in your `dataset.yaml`, you can create custom python code for sample loading
+right next to your dataset inside the `.nv-meta` folder.
+One reason for why we are deprecating this feature, is that you cannot easily version-control the code inside this folder.
+In contrast, cookers live inside your code repository together with the task encoder.
+Here's an example for your updated `dataset.yaml` if you want to use a sample loader:
+```yaml
+sample_type:
+  __module__: megatron.energon
+  __class__: OCRSample
+sample_loader: sample_loader.py:sample_loader
+part_filter: sample_loader.py:part_filter
+```
+In addition, you need to create a python file inside the `.nv-meta` folder. In this case it's
+called `sample_loader.py`.
+That file needs to contain the two methods referenced above:
+```python
+import torch
+def sample_loader(raw: dict) -> dict:
+    data = raw["bbox.json"]
+    return dict(
+        __key__=raw["__key__"],
+        image=raw["jpg"],
+        text=raw["gt.txt"],
+        lines_boxes=torch.tensor([box["bbox"] for box in data], dtype=torch.int64),
+        lines_text=[box["text"] for box in data],
+    )
+def part_filter(part: str) -> bool:
+    return part in ("bbox.json", "gt.txt", "jpg")
+```
+Your `sample_loader` method must accept a dictionary as argument and return a dictionary. It directly operates on 
+the webdataset samples and the resulting dictionary keys should map to the corresponding sample class members.
+In this case an {py:class}`OCRSample <megatron.energon.OCRSample>`.
+With the optional `part_filter` method, you can prevent some webdataset fields from being loaded.
+Given a field name, the method should return True if the field is to be kept.
+(interleaved-sample-loader)=
+## Example: Interleaved Data and Arbitrary Image Count
+### The Webdataset Structure
+If you need multiple files with an arbitrary number of data per sample, e.g. multiple image / video / audio files, this shows a blueprint for how to setup your webdataset tar files and how to load that webdataset with Energon.
+The structure of the shard files could be like this:
+`tar -tvf shard_0.tar`:
+```python
+sample_000001.2345ew.jpg
+sample_000001.json
+sample_000002.35tags.jpg
+sample_000002.as23ds.jpg
+sample_000002.gd1dtg.jpg
+sample_000002.gds233.jpg
+sample_000002.json
+sample_000002.sdag42.jpg
+sample_000003.json
+sample_000004.asf234.jpg
+sample_000004.json
+```
+where the structure of a json file is:
+`tar -xf shard_0.tar sample_000001.json -O`:
+```json
+{
+    "images": [null, "2345ew.jpg", null],
+    "texts": ["This is some text, an image is following.", null, "More text after the image."],
+}
+```
+Note that the image path corresponds to the filename of the image after the first "." in the sample. This is all part of the extension as defined by webdataset. Everything before the first "." is part of the sample key and must be equal to match into the same group.
+### Usage with Energon
+To make this work with Energon, in the `energon prepare` [CLI preparation tool](energon-prepare), you can either tell the wizard to create a custom sample loader template for you, or change the files accordingly. Here is the example with the structure above:
+`.nv-meta/dataset.yaml`:
+```yaml
+sample_type:
+  __module__: megatron.energon
+  __class__: InterleavedSample
+part_filter: sample_loader.py:part_filter
+sample_loader: sample_loader.py:sample_loader
+```
+`.nv-meta/sample_loader.py`:
+```python
+import torch
+def sample_loader(raw: dict) -> dict:
+    # Note that the images are already decoded, as well as the json part.
+    return dict(
+        __key__=raw["__key__"],
+        sequence=[
+            (raw[image] if text is None else text)
+            for image, text in zip(raw["json"]["images"], raw["json"]["texts"])
+        ],
+    )
+def part_filter(part: str) -> bool:
+    # Need to load all parts
+    return True
+```
+When iterating, you'll get those samples as `InterleavedSample` which either contains the image tensor, or the strings for text. The first sample would look like this:
+```python
+InterleavedSample(
+    sequence=["This is some text, an image is following.", torch.Tensor(...) or PIL.Image.Image(), "More text after the image."]
+)
+``` 
--- a/Megatron-Energon/docs/source/advanced/epochized_blending.md
+++ b/Megatron-Energon/docs/source/advanced/epochized_blending.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Epochized Blending
+As an alternative to blending with a weight for each dataset, blending can be made accurate and
+iterating the dataset can follow epochs (i.e. interrupt iteration after an epoch) with this concept.
+Here is an example `metadataset.yaml` config file that changes to epochized blending:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    # Blend the following datasets, repeating coco 5 times, coyo-train 2 times and coyo-val 1 times
+    blend_epochized:
+      - repetitions: 5
+        path: ./coco
+        # ... Other parameters
+      - repetitions: 2
+        path: ./coyo
+      - repetitions: 1
+        path: ./coyo
+        split_part: val
+```
+Now, the call to `get_train_dataset` requires the additional parameter `repeat=False` to interrupt iterating after one epoch:
+```py
+from megatron.energon import get_train_dataset, get_loader, WorkerConfig
+loader = get_loader(get_train_dataset(
+    'metadataset.yaml',
+    batch_size=2,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+    repeat=False,
+))
+# This will now stop iterating after the datasets have been iterated (coco 5 times, coyo-train 2
+# times and coyo-val 1 times). Of course, the data is still being shuffled between all those
+# datasets.
+for batch in loader:
+    print(batch)
+# This will iterate the second epoch
+for batch in loader:
+    print(batch)
+```
+If used as dataset for `get_val_dataset`, the `repetitions` are ignored.
+The metadataset would also work without setting `repeat=False`, but then the shuffle buffer will shuffle samples across bounderies of epochs.
--- a/Megatron-Energon/docs/source/advanced/grouping.md
+++ b/Megatron-Energon/docs/source/advanced/grouping.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Grouping
+Grouping allows for rule-based batching of samples into one batch on the fly.
+Note how this is different from [packing](packing) which joins multiple samples into one (and is done before batching).
+On the other hand, grouping is an alternative to standard batching.
+## Example use cases
+* Select samples to batch based on image resolution, so that only samples of the same size are in one batch
+* Select blended samples based on their dataset origin, so that one batch does not mix different tasks or data types
+## How to group
+To use grouping, you need to define the method {py:meth}`batch_group_criterion <megatron.energon.TaskEncoder.batch_group_criterion>` in your custom task encoder.
+This method gets a sample and returns a hashable value that will be used to cluster/group the samples
+and it also returns the batch size for that group.
+Samples with the same batch group criterion will be batched together. Once enough samples for one group
+have been collected (reached the batch size for that group), they will be batched and pushed down the pipeline
+to the next processing step.
+Here's an example task encoder that batches samples based on their image aspect ratios:
+```python
+class GroupingTaskEncoder(DefaultTaskEncoder):
+    def batch_group_criterion(self, sample: CaptioningSample) -> Tuple[Hashable, Optional[int]]:
+        aspect_ratio = sample.image.shape[2] / sample.image.shape[1]
+        # Bin aspect ratios into 3 groups
+        if aspect_ratio < 0.8:
+            return "portrait", 8
+        elif aspect_ratio < 1.2:
+            return "square", 8
+        else:
+            return "landscape", 8
+```
+In the example, the aspect ratio is sorted into one of three bins and a string is used as the grouping key.
+The batch size used here is always 8.
+Here is another example where each batch contains only images with the exact same size.
+Note how the image shape itself is used as the grouping key.
+```python
+class GroupingTaskEncoder(DefaultTaskEncoder):
+    def batch_group_criterion(self, sample: CaptioningSample) -> Tuple[Hashable, Optional[int]]:
+        batch_size = 4 if sample.image.shape[1] < 512 else 2
+        return sample.image.shape, batch_size
+```
+For images with a height of less than 512 pixels, the batch size will be 4, for larger images it's reduced to 2.
+## Fixed global batch size
+Instead of specifying the batch size for each group individually, you can also specify the batch size as usually when calling
+`get_train_dataset`. The `batch_group_criterion` method should then return `None` for the batch_size.
--- a/Megatron-Energon/docs/source/advanced/joining_datasets.md
+++ b/Megatron-Energon/docs/source/advanced/joining_datasets.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Joining Datasets
+Joining is needed when you have two or more corresponding datasets, each dataset having some parts to compose each sample.
+```{image} ../images/joining.png
+:alt: Joining Datasets
+:class: bg-primary
+:width: 550px
+:align: center
+```
+Dataset joining is analogous to webdataset's "add column", i.e., merging two datasets which require the same files within the tar files.
+## Example File Structure
+Here is an example structure of the datasets within the tar files:
+```
+ds1
+├── .nv-meta
+│   ├── .info.json
+│   ├── split.yaml
+│   └── dataset.yaml
+├── shard1.tar
+│   ├── 0001.jpg
+│   ├── 0002.jpg
+│   └── 0003.jpg
+├── shard1.idx
+ds2
+├── .nv-meta
+│   ├── .info.json
+│   ├── split.yaml
+│   └── dataset.yaml
+├── shard1.tar
+│   ├── 0001.txt
+│   ├── 0002.txt
+│   └── 0003.txt
+├── shard1.idx
+```
+In this example, `ds1/shard1.tar` and `ds2/shard1.tar` contain files with the same base names / keys.
+When joining, the files from both datasets are joined based on their keys, which must be unique within each dataset across all the shards. Each dataset must be [prepared](../basic/data_prep) (i.e. .nv-meta created) with a recent version of energon.
+## Minimal Example
+Below is a minimal meta dataset example for joining.
+ `metadataset_minimal.yaml`:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    join:
+      - path: ds1
+      - path: ds2
+    joiner:
+      __module__: megatron.energon
+      __class__: CaptioningSample
+```
+Note that the blending is omitted here, as it is optional. Joining is of course also optional.
+```{warning}
+All metadatasets that contain a `join` must be prepared with the `energon prepare` command.
+This will compute the join index and store it next to the metadataset in a folder with a similar name.
+```
+## Join Modes
+When joining two datasets, it may happen that the first dataset (primary dataset) has more samples or fewer samples than the secondary dataset(s).
+In this case, we have to decide how to handle these samples that do not match.
+The primary dataset always serves as the reference and there will never be more samples in the join result than in the primary dataset. However if a primary sample has no match in a secondary dataset, it may be skipped as explained below.
+For each of the secondary datasets, the user can specify a `nonmatch` setting.
+With one of the following options, the user can decide what happens, if a sample from the primary dataset is not found in the given secondary dataset:
+* `error` (default): An error is raised
+* `skip`: The whole sample is skipped
+* `none`: The column for the current secondary dataset is filled with `None` if there's no match
+Example `metadataset_nomatch.yaml`:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    join:
+      - path: ds1
+      - path: ds2
+        nonmatch: skip
+      - path: ds3
+        nonmatch: none
+    joiner:
+      __module__: megatron.energon
+      __class__: CaptioningSample
+```
+To illustrate the effect, let's look at some example data:
+* `ds1` samples: `s1`, `s2`, `s3`, `s5`, `s6`
+* `ds2` samples: `s1`, `s3`, `s4`, `s6`, `s7`
+* `ds3` samples: `s1`, `s2`, `s3`, `s100`
+The resulting joined data would contain the following samples, one in each row:
+| ds1 | ds2 | ds3  |
+| --- | --- | ---- |
+| s1  | s1  | s1   |
+| s3  | s3  | s3   |
+| s6  | s6  | None |
+Explanation:
+* The sample key `s1` is available in all dataset.
+* `s2` is missing from `ds2` and nonmatch is set to `skip`, so the sample will not appear in the result.
+* `s3` is available in all datasets.
+* `s4` is not in the primary dataset. Only samples from the primary dataset will be included.
+* `s5` is missing from `ds2` again, and this time also from `ds3`
+* `s6` is missing from `ds3` and `ds3` has `nonmatch` set to `none`, so the sample is not skipped, but the column for `ds3` is set to `None`
+## Extensive Example
+Here is a more extensive example that shows multiple things at once:
+* Joining can be used inside blending
+* The datasets to be joined can have custom subflavors or dataset yamls specified
+* A custom "joiner" can be specified to define how samples are joined and what the resulting type is
+* The `nonmatch` setting is not included here, but would work just like shown above
+`metadataset_extended.yaml`:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    blend:
+      - weight: 1
+        join:
+          - path: ds1
+            dataset_config: dataset.yaml  # If override is needed
+          - path: ds2
+            dataset_config: dataset.yaml
+            subflavors: # If needed, will be merged(overriding) with parent subflavor
+              ds2_extra: 2
+            split_config: split.yaml
+        joiner:
+          __module__: my_module
+          __class__: JoinedSample # Type should implement from_joined(ds1, ds2)
+        split_config: split.yaml  # Sets this for all joined datasets
+        split_part: train  # Sets this for all joined datasets
+        subflavors:  # Sets this for all joined datasets (it will be merged with their individual subflavors)
+          source: metadataset.yaml
+          src: ds1
+```
+## Custom Join Type
+To define a custom join type, you can create a Python class as shown below in `my_module.py`:
+```python
+from dataclasses import dataclass
+import torch
+from megatron.energon import Sample, TextSample
+@dataclass
+class JoinedSample(Sample):
+    text1: torch.Tensor
+    text2: torch.Tensor
+    @staticmethod
+    def from_joined(ds1: TextSample, ds2: TextSample) -> "JoinedSample":
+        return JoinedSample.derive_from(
+            ds1,
+            text1=ds1.text,
+            text2=ds2.text,
+        )
+```
+This class should implement the `from_joined` method to combine samples from `ds1` and `ds2`.
+Note: It is important to use `derive_from` with the first argument being the first sample, as this will guarantee that the state can be saved and restored. It ensures that all the internal keys of the sample are retained.
--- a/Megatron-Energon/docs/source/advanced/packing.md
+++ b/Megatron-Energon/docs/source/advanced/packing.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Packing
+Packing (sometimes also called sequence packing), enables you to selectively compress multiple
+input samples into a single sample, for example depending on their length.
+This technique is commonly used with large language models, if the input samples have very different
+lengths leading to lots of padding and hence wasted compute.
+This section explains how you can pack samples together and utilize the full context length.
+## How to pack samples on the fly
+To use packing, you need to implement the TaskEncoder methods {py:meth}`select_samples_to_pack <megatron.energon.TaskEncoder.select_samples_to_pack>`
+and {py:meth}`pack_selected_samples <megatron.energon.TaskEncoder.pack_selected_samples>`.
+Furthermore, you need to initialize the loader with the `packing_buffer_size` argument set to a non-zero number.
+The `select_samples_to_pack` method will receive a list of samples (size according to the selected `packing_buffer_size`),
+and should partition those samples into groups that shall be packed together. Hence the function returns
+a list of lists of samples.
+For each group, the second method `pack_selected_samples` will be called. You need to implement how a group of
+samples will be mapped to a single sample. In terms of LLMs for example, this method might concatenate the input tokens.
+```{admonition} Note
+:class: important
+You can set the `__restore_key__` of the packed sample to an empty tuple, since energon will set the correct
+restore key afterwards, based on the samples that went in.
+```
+```{warning}
+To handle attention masks and tokenized inputs, you will want to operate on a different sample type.
+The `pack_selected_samples` method may return a different sample type that is expected as the input for the `batch` method.
+```
+It is important, to mark custom functions like `encode_sample` and `pack_selected_samples` as `@stateless` to allow saving
+samples for packing. If augmentations happen, it should be marked with
+`@stateless(restore_seeds=True)`, to deterministically set the seeds based on the `TaskEncoder.current_sample_index`.
+You have to make sure the methods are actually stateless, meaning that they will produce the same output when invoked
+with the same input and random states.
+Example packing for a large language model extending the example from the [](../basic/task_encoder) section:
+```python
+class PackingCaptioningTaskEncoder(CaptioningTaskEncoder):
+    """This class extends the CaptioningTaskEncoder and adds select_samples_to_pack and pack_selected_samples for packing samples
+    efficiently on-the-fly.
+    Set the `packing_buffer_size` of the get_(train|val)_dataset to an accordingly large number to get a
+    properly sized input sample buffer with good diversity.
+    """
+    @stateless(restore_seeds=True)
+    def encode_sample(self, ...):
+        # Added `stateless` decorator to allow saving samples for packing. Will set the seed
+        # deterministically based on the self.current_sample_index.
+        ...
+    def select_samples_to_pack(self, samples: List[CaptioningSample]) -> List[List[CaptioningSample]]:
+        # Do something intelligent here, e.g. sort by caption length and concat where possible.
+        # This could be better, but it's just an example.
+        samples.sort(key=lambda x: len(x.caption))
+        groups = []
+        while len(samples) > 0:
+            batch = []
+            caption_len = 0
+            while len(samples) > 0 and caption_len + len(samples[0].caption) < self.max_length:
+                sample = samples.pop(0)
+                batch.append(sample)
+                caption_len += len(sample.caption)
+            groups.append(batch)
+        return groups
+    @stateless
+    def pack_selected_samples(self, samples: List[CaptioningSample]) -> CaptioningSample:
+        # Construct a new CaptioningSample by concatenating the captions
+        ...
+```
--- a/Megatron-Energon/docs/source/advanced/parallelism.md
+++ b/Megatron-Energon/docs/source/advanced/parallelism.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Parallelism
+Neural network parallelism can be categorized into several types:
+1. **Data Parallelism** (DP): This involves splitting the data across multiple processors and performing the same operation on each subset of the data. It is commonly used to increase the global batch size.
+2. **Model Parallelism**: In this approach, different parts of the model are distributed across multiple processors. This is useful when the model itself is too large to fit into the memory of a single processor.
+3. **Pipeline Parallelism** (PP): This technique involves breaking down the model into different stages and processing different mini-batches of data through these stages in a pipeline fashion. It helps in improving the utilization of resources and reducing idle time.
+4. **Tensor Parallelism** (TP): This method splits individual tensors (weights and activations) across multiple devices. It is particularly effective for very large models where even a single layer cannot fit into the memory of one device.
+These parallelisms have different consequences for the dataloader:
+- **Data Parallelism** (DP): The dataloader needs to ensure that each processor gets a different subset of the data. This is supported by Energon. The data parallel groups should be specified in the worker config.
+- **Pipeline Parallelism** (PP): Data is typically only loaded on the first Pipeline Parallel rank, and propagates through the other ranks within the pipeline parallel group. This means, you only instantiate an Energon dataset and loader on the first ranks of those groups.
+- **Tensor Parallelism** (TP): The dataloader will load the same input data on multiple devices. Typically, this can be ensured by either instantiating the dataloader exactly the same on the same data parallel ranks in different data parallel groups, or e.g. by loading the data only once and distributing it using torch distributed.
+## Example
+Example with the following ranks and worker configuration (Data Parallel = 2, Pipeline Parallel = 2, Tensor Parallel = 2):
+* `Global Rank 0`: `DP Rank = 0` (DP group A), `PP Rank = 0`, `TP Rank = 0`
+* `Global Rank 1`: `DP Rank = 0` (DP group B), `PP Rank = 0`, `TP Rank = 1`
+* `Global Rank 2`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 0`
+* `Global Rank 3`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 1`
+* `Global Rank 4`: `DP Rank = 1` (DP group A), `PP Rank = 0`, `TP Rank = 0`
+* `Global Rank 5`: `DP Rank = 1` (DP group B), `PP Rank = 0`, `TP Rank = 1`
+* `Global Rank 6`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 0`
+* `Global Rank 7`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 1`
+When saving the state of the data loader, we only need to store the states
+of global ranks 0 and 4, i.e. the fist DP group "A".
+Ranks 1 and 5 will have the same state as they are duplicates.
+When restoring the state, global ranks 0, 1, 4, 5 need to receive a state.
+There are different ways to achieve this. The following example illustrates how the state
+can be saved and restored in a distributed setting.
+```py
+import torch
+from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig
+# Initialize the process group
+torch.distributed.init_process_group(backend='nccl')
+# Get the DP, PP, TP ranks
+global_rank = torch.distributed.get_rank()
+data_parallel_rank = [0, 0, None, None, 1, 1, None, None][global_rank]
+pipeline_parallel_rank = [0, 0, 1, 1, 0, 0, 1, 1][global_rank]
+tensor_parallel_rank = [0, 1, 0, 1, 0, 1, 0, 1][global_rank]
+if global_rank in (0, 4):
+    # DP Group A
+    # If on rank 0 or 4, the DP group consists of those ranks (each representing DP ranks 0 and 1).
+    data_parallel_group = torch.distributed.new_group(ranks=[0, 4])
+elif global_rank in (1, 5):
+    # DP Group B
+    # If on rank 1 or 5, the DP group consists of those ranks (each representing DP ranks 0 and 1).
+    data_parallel_group = torch.distributed.new_group(ranks=[1, 5])
+else:
+    data_parallel_group = None
+if data_parallel_rank is not None:
+    assert pipeline_parallel_rank == 0, "Only Pipeline Parallel ranks 0 load data"
+    # Set the worker config correspondingly
+    worker_config = WorkerConfig(
+        rank=data_parallel_rank,
+        world_size=torch.distributed.get_world_size(data_parallel_group),
+        num_workers=3,
+        data_parallel_group=data_parallel_group,
+    )
+    # Create the loader with that config
+    loader = get_savable_loader(get_train_dataset(
+        'coyo-coco-dataset.yaml',
+        batch_size=4,
+        shuffle_buffer_size=100,
+        max_samples_per_sequence=100,
+        worker_config=worker_config,
+    ))
+    # Iterate the data
+    for i, batch in zip(range(10), loader):
+        # Do forward-backward pass
+        print(batch)
+        break
+    if tensor_parallel_rank == 0:
+        # Save the state only for the first TP rank (the other TP ranks have a copy of that state)
+        # Save the state
+        state = loader.save_state_rank()
+        # E.g. save to disk with torch
+        torch.save(state, f"dataloader_rank{data_parallel_rank}.pt")
+        # Alternatively, save once for the whole dp group:
+        # state = loader.save_state_global(global_dst_rank=0)
+        # if state is not None:
+        #     torch.save(state, "dataloader.pt")
+# ... when loading:
+if data_parallel_rank is not None:
+    assert pipeline_parallel_rank == 0, "Only Pipeline Parallel ranks 0 load data"
+    # Restore the state for a new loader
+    loader = get_savable_loader(get_train_dataset(
+        'coyo-coco-dataset.yaml',
+        batch_size=4,
+        shuffle_buffer_size=100,
+        max_samples_per_sequence=100,
+        worker_config=worker_config,
+    ))
+    # E.g. load from disk as saved above
+    state = torch.load(f"dataloader_rank{data_parallel_rank}.pt")
+    # Restore the state
+    loader.restore_state_rank(state)
+    # Alternatively, when using a global checkpoint,
+    # load the checkpoint from disk on every dp rank:
+    # state = torch.load("dataloader.pt")
+    # loader.restore_state_global(state)
+    # Or load only once from disk for each dp group:
+    # if data_parallel_rank == 0:
+    #     state = torch.load("dataloader.pt")
+    # else:
+    #     state = None
+    # loader.restore_state_global(state, src_rank=0)
+```
--- a/Megatron-Energon/docs/source/advanced/remote_dataset.md
+++ b/Megatron-Energon/docs/source/advanced/remote_dataset.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Remote Dataset
+Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on [Multi Storage Client (MSC)](https://github.com/NVIDIA/multi-storage-client).
+This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called _MSC URL_.
+## Prerequisites
+For using a remote dataset, install energon with one or more of the extras:
+* `s3`
+* `aistore`
+* `azure-blob-storage`
+* `google-cloud-storage`
+* `oci`
+like this:
+```sh
+pip install megatron-energon[s3,oci]
+```
+Set up the msc config as described in [Multi Storage Client documentation](https://nvidia.github.io/multi-storage-client/).
+You can also use the rclone config with msc, as was described prior to 5.2.0.
+For fast data loading we recommend to activate MSC local caching:
+```yaml
+cache:
+  size: 500G
+  use_etag: true
+  eviction_policy:
+    policy: "fifo"
+    refresh_interval: 3600
+  cache_backend:
+    cache_path: /tmp/msc_cache # prefer to use local NVMe, but Lustre path also works
+```
+And point MSC to the config with 
+```sh
+export MSC_CONFIG=/path/to/msc_config.yaml
+```
+## The URL syntax
+The syntax is a simple as 
+```
+msc://CONFIG_NAME/PATH
+```
+For example:
+```
+msc://coolstore/mainbucket/datasets/somedata
+```
+You can use this URL instead of paths to datasets in
+* Functions like `get_train_dataset`, `get_val_dataset`
+* Inside [metadataset](../basic/metadataset) specifications
+* As arguments to `energon prepare` or `energon lint`. Note that those may be slow for remote locations.
+* Or as a path to [`energon mount`](energon-mount) to locally inspect your remote dataset 😎
+Example usage:
+```python
+ds = get_train_dataset(
+    'msc://coolstore/mainbucket/datasets/somedata',
+    batch_size=1,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+)
+```
--- a/Megatron-Energon/docs/source/advanced/repro_scaling.md
+++ b/Megatron-Energon/docs/source/advanced/repro_scaling.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Reproducible Scaling
+A special use case is to re-run or continue a training run with the exact same data order, but using a different number of nodes or ranks.
+Since version 2.0.0, Megatron Energon supports this behavior if a few constraints are met:
+* The energon major version must be the same across runs
+* The global batch size must stay the same across runs
+* The global batch size must be a multiple of `micro-batch size * world_size * num_workers`
+  * The multiple of that is the number of gradient accumulation steps in your training
+* The product `world_size * num_workers` must stay the same across runs, such that the global number of workers stays the same
+* When using random seed offsets in your  {py:class}`WorkerConfig <megatron.energon.WorkerConfig>`, those need to be the same
+By obeying these rules, you will be able to reproduce the same global batches. Let's look at an example.
+| Name  | Global batch size | Micro batch size | World size | Number of Workers | Gradient accumulation steps |
+| ----- | ----------------- | ---------------- | ---------- | ----------------- | --------------------------- |
+| Run 1 | 8                 | 2                | 4          | 1                 | 1                           |
+| Run 2 | 8                 | 2                | 1          | 4                 | 4                           |
+Iterating the dataset will yield the same global batches for both of these runs, if the seed is set correctly.
+In practice, you will need to adapt your worker config accordingly.
--- a/Megatron-Energon/docs/source/advanced/subsets.md
+++ b/Megatron-Energon/docs/source/advanced/subsets.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Dataset Subsets
+Dataset subsets allow restricting a dataset (or parts of a metadataset hierarchy) to a specific portion of the available samples.
+This is useful for rapid prototyping, ablation studies, different training stages, or constructing disjoint train/validation/test splits that differ from the original dataset configuration.
+A subset is defined by a two-element `range` list consisting of `[start, end]` (where `start` is inclusive, `end` exclusive).
+Each element can be either
+* a **percentage** string (e.g. `"0%"`, `"12.5%"`, `"100%"`) – interpreted relative to each inner
+  dataset size, or
+* an **absolute** integer – interpreted as a sample index. Absolute indices are only allowed for
+  *leaf* datasets (`path` to a prepared dataset containing `.nv-meta`).
+## Basic example
+The snippet below keeps the first 80 % of *COYO* `train` split (as defined in the `split.yaml`) for training while
+evaluating on the remaining 20 % of the `train` split. Note how the `subset` key is placed directly next to the corresponding `path`.
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    path: ./coyo
+    subset: {range: [0%, 80%]}
+  val:
+    path: ./coyo
+    split: train
+    subset: {range: [80%, 100%]}
+```
+## Nested subsets and merging rules
+Subsets can appear at any level that ultimately yields samples
+(direct `path` reference to a prepared dataset containing `.nv-meta`, `join`, `blend`, `blend_epochized`).
+When multiple subsets are nested, the *inner* subset is applied first, then the portion selected by the *outer* subset is applied *within* the already selected range.
+For percentages the ranges are composed multiplicatively.
+Example: the outer subset `[0%, 50%]` followed by an inner subset `[25%, 75%]` results in the final
+range `[25%, 50%]` of the original dataset.
+Absolute indices short-circuit merging: they can **only** be specified at the leaf level and must
+not be combined with another absolute range farther up the hierarchy.
+## Absolute ranges
+Absolute indices are handy when exact sample counts are required.
+## Advanced examples
+The following configuration combines the absolute ranges with the nested rules. The inner subset takes
+the first **1000** samples from *COCO* train split and mixes them with the full *COYO* train split using
+weight-based blending. The outer nesting then reduces the inner ranges to the first 50%, thus only taking
+the first **500** samples of *COCO*, mixed with the first **50%** of the *COYO* dataset effectively.
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    # The subset is applied to each blended dataset separately.
+    # I.e. for the first, the sample range is [0, 500], for the second the range is [0%, 50%]
+    subset: {range: [0%, 50%]}
+    blend:
+      - weight: 1.0
+        path: ./coco
+        subset:
+          # Take exactly 1000 samples (indices 0-999)
+          range: [0, 1000]
+      - weight: 1.0
+        path: ./coyo
+```
+Absolute ranges can also be specified to run up to the end of the dataset using the `end` keyword:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  train:
+    path: ./coyo
+    subset: {range: [1422, end]}
+```
+## Python usage
+No API changes are required on the Python side – subsets are fully specified in the YAML. Simply
+load the dataset with the regular helpers.
--- a/Megatron-Energon/docs/source/api/cli.md
+++ b/Megatron-Energon/docs/source/api/cli.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Command-Line Interface
+After you [installed](../intro/installation) energon, a script called `energon` will be added to your PATH.
+It provides commands to prepare, preview, or lint datasets on disk.
+Here's a simple example:
+```shell
+energon prepare /mnt/data/my_captioning_webdataset
+```
+The above command will scan your existing off-the-shelf [web dataset](https://webdataset.github.io/webdataset/)
+and add the [needed metadata](data-on-disk) to make it compatible with Energon. 
+Below, you can see the available sub-commands under `energon`.
+```{eval-rst}
+.. click:: megatron.energon.cli.main:main
+   :prog: energon
+   :nested: short
+```
+(energon-prepare)=
+## energon prepare
+An interactive tool to generate metadata for your existing webdataset (or [jsonl dataset](create-jsonl-dataset)).
+This will help make the dataset compliant with our [format](data-on-disk).
+The tool will ask you for a train/val/test split and how to assign the webdataset fields to the
+fields of the corresponding sample type in Energon.
+See [Data Preparation](../basic/data_prep) for more details on how to use this command.
+## energon info
+Prints information about the dataset such as overall number of samples and size.
+It also prints the energon version that was used to prepare the dataset, if a recent version was used.
+## energon lint
+You can execute this tool on the prepared dataset to check if the data is valid and loadable.
+It will report any problems such as non-readable images.
+(energon-mount)=
+## energon mount
+Use this to mount your [prepared dataset](../basic/data_prep) as a virtual read-only filesystem and inspect it using `ls` or other file browsing tools.
+It is as simple as running
+```shell
+energon mount /PATH/TO/DATASET ./MY_MOUNT_POINT
+```
+This will leave the process in the foreground and the mount will exist as long as the program is running.
+If you want to detach the process to the background, use the `-d` or `--detach` flag.
+Two modes are supported by `energon mount`:
+|     | Flat mode (default)  | Sample folder mode (flag `-s`)  |
+| --- | --- | --- |
+| Description  | All files from all shards listed at<br/>the root of the mount point.  | One folder per sample key,<br/>each folder containing files<br/>named by the sample part extension   |
+| Example      | `001.jpg`<br/>`001.txt`<br/>`002.jpg`<br/>`002.txt`<br/>`...`  | `001/`<br/>`┣ jpg`<br/>`┗ txt`<br/>`002/`<br/>`┣ jpg`<br/>`┗ txt`<br/>`...`   |
+```{warning}
+You should not use the same sample keys in multiple shards of the same dataset.
+If you do, `energon mount` will not work as intended and it will display WARNING files in the virtual mount.
+```
+## energon preview
+This command will load a dataset and display samples one-by-one on the console.
+Note that this will not work for datasets with non-standard flavors or crude datasets.
--- a/Megatron-Energon/docs/source/api/modules.md
+++ b/Megatron-Energon/docs/source/api/modules.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# Packages and Modules
+```{toctree}
+---
+maxdepth: 2
+---
+modules_data
+```
--- a/Megatron-Energon/docs/source/api/modules_data.md
+++ b/Megatron-Energon/docs/source/api/modules_data.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+# megatron.energon
+```{eval-rst}
+.. automodule:: megatron.energon
+    :members:
+    :undoc-members:
+    :show-inheritance:
+.. automodule:: megatron.energon.task_encoder.cooking
+    :members:
+    :undoc-members:
+    :show-inheritance:
+```
+# megatron.energon.av
+```{eval-rst}
+.. automodule:: megatron.energon.av
+    :members:
+    :undoc-members:
+    :show-inheritance:
+```