Initial commit

f356f546 · maming · f356f546 · f356f546 · f356f546 · f356f546
Commit f356f546 authored Feb 04, 2026 by maming
20 changed files
--- a/Megatron-Energon/docs/source/basic/basics_flow.md
+++ b/Megatron-Energon/docs/source/basic/basics_flow.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Data Flow
+
+![energon data flow](../images/data_flow.png)
+
+The steps of how the data flows through those task encoder methods is explained in [](task_encoder).
+
+(flavors_general)=
+## Dataset Flavors
+
+The datasets are organized in "flavors", i.e. each modality returned by the dataset is a "flavor".
+A modality can for example be a {py:class}`CaptioningSample <megatron.energon.CaptioningSample>` or an 
+{py:class}`VQASample <megatron.energon.VQASample>`. The dataset class combines the source data format
+and the iterated sample format. For example, the {py:class}`CaptioningWebdataset <megatron.energon.CaptioningWebdataset>` 
+combines the webdataset loader with the {py:class}`CaptioningSample <megatron.energon.CaptioningSample>`.
+
+For all types, see [](sect-sample-types)
--- a/Megatron-Energon/docs/source/basic/data_decoding.md
+++ b/Megatron-Energon/docs/source/basic/data_decoding.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Data Decoding
+
+When iterating your dataset, the first processing step that each sample goes through, even before [](sample-loading), is the decode step.
+
+As explained [here](wds-format), each sample in the underlying `.tar` file can have multiple entries with different extensions.
+Just like WebDataset itself, energon uses the [auto decode](https://rom1504.github.io/webdataset/api/webdataset/autodecode.html) functionality of WebDataset to transform those raw bytes from the tar file entries
+to usable objects. Images for example shall be decoded to arrays or tensors of pixel data.
+
+In Energon, the class responsible for data decoding is {py:class}`DefaultDecoderWebdatasetFactory <megatron.energon.DefaultDecoderWebdatasetFactory>` (which usually you don't need to touch directly). In its `__init__` method, it contains a piece of code that initializes the auto decoder and tells it which data types shall be decoded and how.
+
+Decoders typically convert
+
+* Text to strings (instead of bytes)
+* JSON to dictionaries and lists
+* Images, videos and audios to pixel tensors or audio sample tensors
+    * For other options see below
+* Torch PTH files to tensors
+* NPY files to numpy arrays
+
+## How to control data decoding
+
+Starting with Energon 7.0.0, the new way to configure data decoding is to specify a `decoder` as a class variable of your [Task Encoder](../basic/task_encoder):
+```python
+class MyTaskEncoder(DefaultTaskEncoder):
+    decoder = SampleDecoder(image_decode="pilrgb")
+```
+
+Typically, you will just instantiate a {py:class}`SampleDecoder <megatron.energon.SampleDecoder>` and provide the arguments to configure it, as shown above.
+If you do **not** want automatic decoding at all, you have to explicitely set `decoder = None` in your TaskEncoder.
+
+Here are the different options you can pass to {py:class}`SampleDecoder <megatron.energon.SampleDecoder>`:
+
+* `image_decode` (str)
+    * Can be set to an image decoder from webdataset. Here are some examples:
+        * `pil`: Returns the image as a PIL image
+        * `torchrgb` Returns the image as a torch tensor with 3 color channels.
+    * For more options, check out the [official documentation](https://rom1504.github.io/webdataset/api/webdataset/autodecode.html#ImageHandler).
+* `av_decode` (str)
+    * Can be one of `AVDecoder`, `torch`, `pyav`. The default is `AVDecoder` which is explained [below](av-decoder).
+    * The option `torch` would decode video and audio entirely and return them as tensors.
+    * The `pyav` option is for advanced use cases where you need direct access to the object returned by `av.open()`
+* `video_decode_audio` (bool)
+    * If `True`, videos that have an audio track will decode both the video and the audio. Otherwise, only the video frames are decoded.
+* `guess_content` (bool)
+    * *New in Energon 7.0.0*
+    * Whether to guess the contents of the file using the `filetype` package. Useful if you have files without extensions in your data.
+
+### Legacy method before Energon 7.0.0
+
+
+```{warning}
+The below method of configuring auto decoding was deprecated in Energon 7.0.0. Please migrate to the above new method with `SampleDecoder`
+```
+
+In older versions of Energon, you could pass arguments when calling {py:func}`get_train_dataset <megatron.energon.get_train_dataset>` or {py:func}`get_val_dataset <megatron.energon.get_val_dataset>`.
+The arguments are more or less identical to what can be passed to `SampleDecoder` above, except:
+
+* `auto_decode` (bool)
+    * Set to `False` to disable all automatic decoding of the data. In your sample loader, you will then get raw bytes. The default setting is `True`
+    * Setting to `False` is equivalent to setting `decoder = None` in the new version.
+* `guess_content` (bool)
+    * Not available in older versions
+
+
+(av-decoder)=
+## AV Decoder (Audio and Video)
+
+Energon comes with code to efficiently decode compressed video and audio files such as MP4, MP3 or WAV.
+It integrates a library called `fastseek` that allows to quickly extract sub-clips from longer videos.
+The actual decoding is done by PyAV which in turn relies on ffmpeg.
+
+When choosing `AVDecoder` as the option for `av_decode` above, an object of the type {py:class}`AVDecoder <megatron.energon.av.AVDecoder>` (click to see methods) will be returned. 
+At this point, the file has not yet been decoded, but in your [custom sample loader](custom-sample-loader) or in your [TaskEncoder](../basic/task_encoder), you can read parts or all of the file using the methods of `AVDecoder`.
+
+For example to extract from a video, the first 2 seconds and the 2 seconds starting after 4 seconds:
+```py
+dat = av_decoder.get_clips(
+    video_clip_ranges=[(0, 2), (4, 6)],
+    video_unit="seconds",
+)
+
+# Print both clips' tensor shapes
+print(dat.video_clips[0].shape)
+print(dat.video_clips[1].shape)
+```
+The return value of {py:meth}`get_clips <megatron.energon.av.AVDecoder.get_clips>` will be an object of type {py:class}`AVData <megatron.energon.av.AVData>`. It may contain video or audio tensors or both, depending on what you requested. It will also contain the timestamps of the actually returned clips.
+
+Click on {py:class}`AVData <megatron.energon.av.AVData>` to see the fields and the shapes of the tensors.
+
+
+You can also use helper methods to extract multiple clips at equal distances all at once:
+```py
+from megatron.energon.av import get_clips_uniform
+
+dat = get_clips_uniform(
+    av_decoder=av_decoder, num_clips=5, clip_duration_seconds=3, request_audio=True
+)
+```
+This would extract 5 clips, each 3 seconds long also including the corresponding audio clips.
+The first clip will start at position 0 and the last clip would end at the end of the video.
+This is essentially just a convenience wrapper around `get_clips`, so the code may be a good starting point if you're looking to write a custom extraction logic.
+
+We also provide a method {py:func}`get_single_frames_uniform <megatron.energon.av.get_single_frames_uniform>` which will return a tensor of frames directly instead of an `AVData` object.
+
+
+The simplest case is to decode the whole video or audio or both:
+```py
+dat = av_decoder.get_video()
+
+# or
+dat = av_decoder.get_audio()
+
+# or
+dat = av_decoder.get_video_with_audio()
+```
\ No newline at end of file
--- a/Megatron-Energon/docs/source/basic/data_prep.md
+++ b/Megatron-Energon/docs/source/basic/data_prep.md
--- a/Megatron-Energon/docs/source/basic/glossary.md
+++ b/Megatron-Energon/docs/source/basic/glossary.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Glossary
+
+* **Batch Grouping**
+    * Allows you to programmatically decide which samples (out of a buffer) will be put into one batch. See [](../advanced/grouping.md).
+* **Cooking**
+    * Used to transform crude (raw) samples into a populated instance of a sample data class.
+* **Crude Dataset**
+    * An energon dataset, that does not yield a readily-populated sample (instance of dataclass), but a raw dict.
+    * A cooker is used to handle this transformation in the user's custom task encoder. See [](crude-data).
+* **Grouping**
+    * See "Batch Grouping"
+* **Monolithic Dataset**
+    * The simple form of putting all your text and media data into the same WebDataset (see [](monolithic-dataset)).
+    * The other option is to use a "Polylithic Dataset"
+* **Packing**
+    * For Energon, with "packing" we mean "sequence packing". See "Sequence Packing" below.
+* **Polylithic Dataset**
+    * Used to split the text-based data from the (usually larger) media data.
+    * Each modality will be put in its own dataset and one dataset can refer to the other by file names.
+    * For more information see [](polylithic-dataset)
+* **Sample**
+    * In Energon, by sample we typically mean an instance of {py:class}`Sample <megatron.energon.Sample>` (e.g. one of its subclasses)
+    * Sometimes we also call the source files that are inside the WebDataset and are used to create that dataclass instance a "sample"
+        * For example inside one tar file there may be `004.jpg` and `004.txt` (image and label) together forming a captioning sample
+    * The {py:class}`Sample <megatron.energon.Sample>` dataclass has several mandatory and optional fields that describe one piece of training data for your ML workload. Typically it contains the input data to the model and the label data.
+* **Sample Part**
+    * A "sample part" is one of the components of a sample inside the WebDataset tar file. A captioning sample may be created from `004.jpg` and `004.txt` and each of those files is a sample part. This sample with the *key* `004` has two *parts* `txt` and `jpg`.
+* **Sequence Packing**
+    * A method to better utilize the available context length / sequence length of a model and reduce padding.
+    * Explained in [](../advanced/packing.md)
+* **Task Encoder**
+    * An Energon-specific concept: The TaskEncoder is a user-defined class to customize the steps of the data flow pipeline.
+    * See [](../basic/basics_flow.md) and [](../basic/task_encoder.md)
+* **WebDataset**
+    * A file-format to store your dataset on disk, based on TAR files. See [https://github.com/webdataset/webdataset](https://github.com/webdataset/webdataset).
+    * Energon's dataset format builds on WebDataset and extends it with additional files, see [](data-on-disk).
--- a/Megatron-Energon/docs/source/basic/metadataset.md
+++ b/Megatron-Energon/docs/source/basic/metadataset.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Metadataset
+
+Metadatasets allow combining datasets together in a variety of ways by using a yaml config file.
+This is useful for example if you want to mix multiple datasets together, and especially if you want to reuse that combination.
+
+To create a metadataset, you simply create a yaml file of the following format.
+Example `demo-metadataset.yaml` file:
+
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  # Split defines the train/val/test subsets composition 
+  train:
+    # Mix the following datasets
+    blend:
+      - weight: 5
+        path: ./coco
+      - weight: 2
+        path: ./coyo
+      - weight: 1
+        path: ./other
+  val:
+    # Only use coco-val for val
+    path: ./coco
+  test:
+    # Only use coyo-test for test
+    path: ./coyo
+```
+
+
+In the above example, we create a blend of three datasets. Out of the yielded training samples, 62.5% ({math}`=\frac{5}{8}`) will come from `./coco`, 25% from `./coyo` and 12.5% from `./other`.
+Note that the relative paths in the metadataset are relative to the location of the metadataset file. Absolute paths are allowed but won't work for object storage.
+
+To use the metadataset in your loader, simply load it with {py:func}`get_train_dataset <megatron.energon.get_train_dataset>` instead of a normal energon dataset:
+```python
+from megatron.energon import get_train_dataset
+
+ds = get_train_dataset(
+    'demo-metadataset.yaml',
+    batch_size=4,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+)
+
+for batch in ds:
+    print(batch)
+    break
+
+```
+
+Here is another example that takes both the training and the validation set of coyo into the blended training data (with different weights though):
+
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  # Split defines the train/val/test subsets composition 
+  train:
+    # Mix the following datasets
+    blend:
+      - weight: 5
+        path: ./coco
+      - weight: 2
+        path: ./coyo
+        split_part: train
+      - weight: 1
+        path: ./coyo
+        split_part: val  # <-- Takes the val set of coyo into the train split
+  val:
+    # Only use coco-val for val
+    path: ./coco
+  test:
+    # Only use coyo-test for test
+    path: ./coyo
+```
+
+Actually `split_part: train` is the default, so there's no need to explicitely specify that.
+When referring to datasets under `val:` obviously `split_part: val` is the default.
+
+Energon also supports blending by specifying the number of repetitions for each dataset using [Epochized Blending](../advanced/epochized_blending).
+
+(sect-subflavors)=
+## Subflavors
+
+Subflavors are a way to *tag* samples that come from different origins so that they can still be differentiated after blending.
+Even when blending many datasets together, you might want to handle some of them differently in your [Task Encoder](task_encoder).
+For example when doing OCR, you might have one dataset with full pages of text and one with only paragraphs. In your task encoder you could decide to augment the images differently.
+
+Here is a modified example of the above `metadataset.yaml` config file that adds some subflavors:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  # Split defines the train/val/test subsets composition 
+  train:
+    # Blend the following datasets
+    blend:
+      - weight: 5
+        path: ./coco
+        # Set the __subflavors__ property of the samples
+        subflavors:
+          augmentation_type: small_images
+          text_length: short
+      # Combine coyo-train and coyo-val
+      - weight: 2
+        path: ./coyo
+        split_part: train
+        # Set the __subflavors__ property of the samples
+        subflavors:
+          augmentation_type: large_images
+          text_length: short
+      - weight: 1
+        path: ./coyo
+        split_part: val
+        # Set the __subflavors__ property of the samples
+        subflavors:
+          augmentation_type: large_images
+          text_length: short
+  # For val and test, blending will actually concatenate the datasets
+  val:
+    # Only use coco val for val
+    path: ./coco
+    subflavors:
+      augmentation_type: small_images
+      text_length: short
+  test:
+    path: ./coyo
+```
+
+In the above example, the coco training samples will now have the subflavor `augmentation_type` set to `small_images` while the samples from coyo, will have that property set to `large_images`.
+
+Note that subflavors are entirely custom and you can use any name and any value for them, for example `foo: bar`
+In the code they will be passed around as a dictionary.
+
+## Auxiliary Data
+
+In the metadataset, you can also specify auxiliary data for each blended dataset.
+Please check out [Auxiliary Data](aux-data), for more information.
+
+## Classes
+* {py:class}`DatasetLoaderInterface <megatron.energon.DatasetLoaderInterface>`: Common interface for dataset loaders. Provides methods for constructing/loading the actual train- or val-mode dataset.
+  * {py:class}`MetadatasetV2 <megatron.energon.MetadatasetV2>`: The metadataset loader using the yaml example above. Blends datasets for train-mode, and concatenates for val-mode.
+  * {py:class}`DatasetLoader <megatron.energon.DatasetLoader>`: The dataset loader using a dataprepped folder (containing `.nv-meta` folder).
+
+## Functions
+* {py:func}`get_train_dataset <megatron.energon.get_train_dataset>`: Returns the train-mode (meta)dataset.
+* {py:func}`get_val_dataset <megatron.energon.get_val_dataset>`: Returns the val-mode (meta)dataset.
--- a/Megatron-Energon/docs/source/basic/quickstart.md
+++ b/Megatron-Energon/docs/source/basic/quickstart.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Quickstart
+
+You can use Megatron Energon to load datasets in the energon dataset format. This format is essentially [WebDataset](https://github.com/webdataset/webdataset) with some meta data added.
+Since version 7.2.0, we also support [JSONL](create-jsonl-dataset) as a simpler format with fewer features.
+
+For a moment let's assume you already have prepared a dataset in the needed format, and it's stored on
+disk at `/my/dataset/path`. If you want to create a dataset now, check out [](data_prep).
+
+If you simply want to use some dummy data for trying this out, checkout the unit test method `create_captioning_test_dataset` inside `tests/test_dataset.py`.
+
+```{admonition} Good to know
+:class: tip
+You can also store your dataset inside an S3-compatible object store and load it from there! See [](../advanced/remote_dataset)
+```
+
+You can then load the dataset like this:
+```python
+from megatron.energon import get_train_dataset, get_loader, WorkerConfig
+
+ds = get_train_dataset(
+    '/my/dataset/path',
+    batch_size=1,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+)
+
+loader = get_loader(ds)
+
+for batch in loader:
+    # Do something with batch
+    # Infer, gradient step, ...
+    pass
+```
+
+At first, we call {py:meth}`get_train_dataset <megatron.energon.get_train_dataset>` (click to see signature).
+The method will check what kind of dataset is on disk and instantiate the correct class for it.
+
+A worker configuration is always needed to specify how the work is distributed across multiple ranks and workers.
+In this simple example, we use a helper method {py:meth}`default_worker_config <megatron.energon.WorkerConfig.default_worker_config>` to get reasonable default values.
+
+The dataset should not be iterated directly, but used with a loader which handles the worker processes.
+The batches will contain samples of the sample type specified in the [task encoder](task_encoder).
+
+```{admonition} Good to know
+:class: tip
+Since we did not specify a task encoder above, the {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>` will be used.
+It will not transform the data. For batching it will use common sense magic to pad and stack tensors or build lists if the type is unknown.
+```
+
+_Wait. Why does the dataset create batches? Shouldn't the dataloader do that?_
+
+Energon will create batches at dataset level.
+Internally, most of the cool things that energon can do (such as blending datasets together, [sequence packing](../advanced/packing), etc.)
+are dataset wrappers. Even the process of batching is such a wrapper and the default {py:meth}`get_train_dataset <megatron.energon.get_train_dataset>`
+function will construct a suitable combination of all these based on the arguments you pass to that function.
+Check out the [](basics_flow) section to see the steps in which the data is processed.
+
+_Why must `shuffle_buffer_size` and `max_samples_per_sequence` be set explicitly?_
+
+As the library is designed to work on (sequential) webdatasets but still wants to provide proper shuffling, these parameters are required. To make sure, the user does not forget to set these, we enforce them to be set explicitly.
+A value of 100 for both settings for image datasets seems to work well (i.e. balanced shuffling randomness vs seeking performance impact), but datasets where the samples are lots larger or smaller might require different settings.
+Setting the sequence length to a very small size compared to the number of samples in the dataset will result in more random access, thus slowing down dataloading, so the recommendation is to set it to a high enough value.
+At the same time, a high value reduces the shuffling randomness, which requires a larger shuffle buffer size to compensate for that (i.e. higher memory footprint and longer state restore times).
+
+## Tutorial 1: Preparing the Dataset
+
+For running any tutorials, you'll need your data structured as webdataset. For more details and more options check out [](data_prep).
+
+For example, in your folder containing the tar files, run
+```sh
+$ energon prepare ./
+
+# Example answers to interactive questions:
+Ratio: 8,1,1
+Dataset class: CaptioningWebdataset
+Field map: Yes
+  image: jpg
+  caption: txt  # if txt contains the caption
+# or
+  caption: json[caption]  # if .json contains {"caption": "My nice image"}
+```
+
+
+## Tutorial 2: Loading a Dataset
+
+Let's be a bit more concrete and try out the above data loading code with a real dataset.
+We are going to print the first batch and stop.
+
+```python
+from megatron.energon import get_train_dataset, get_loader, WorkerConfig
+
+ds = get_train_dataset(
+    '/path/to/your/dataset',
+    batch_size=None,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+)
+
+loader = get_loader(ds)
+
+for batch in loader:
+    print(batch)
+    break
+```
+
+This prints
+
+```text
+CaptioningSample(__key__='part_00123/00403.tar/004030195', image=tensor([[[0.4549, 0.4784, 0.5059,  ..., 0.8392, 0.8275, 0.7961],
+          [0.4549, 0.4784, 0.5020,  ..., 0.6431, 0.6275, 0.5882],
+          [0.4510, 0.4706, 0.4941,  ..., 0.6235, 0.6353, 0.6078],
+          ...,
+          [0.4471, 0.4196, 0.4510,  ..., 0.8471, 0.8039, 0.8275],
+          [0.4667, 0.4353, 0.4667,  ..., 0.8196, 0.7804, 0.8078],
+          [0.4824, 0.4549, 0.4824,  ..., 0.8196, 0.7843, 0.8118]],
+
+         [[0.3608, 0.3843, 0.4118,  ..., 0.7373, 0.7255, 0.6941],
+          [0.3608, 0.3843, 0.4078,  ..., 0.5412, 0.5255, 0.4863],
+          [0.3569, 0.3765, 0.4000,  ..., 0.5098, 0.5216, 0.4941],
+          ...,
+          [0.3608, 0.3333, 0.3647,  ..., 0.7529, 0.7098, 0.7333],
+          [0.3804, 0.3490, 0.3804,  ..., 0.7255, 0.6863, 0.7137],
+          [0.3961, 0.3686, 0.3961,  ..., 0.7255, 0.6902, 0.7176]],
+
+         [[0.2510, 0.2745, 0.3020,  ..., 0.6000, 0.5882, 0.5569],
+          [0.2510, 0.2745, 0.2980,  ..., 0.4039, 0.3882, 0.3490],
+          [0.2471, 0.2667, 0.2902,  ..., 0.3765, 0.3882, 0.3608],
+          ...,
+          [0.2667, 0.2392, 0.2706,  ..., 0.6510, 0.6000, 0.6235],
+          [0.2863, 0.2549, 0.2863,  ..., 0.6235, 0.5765, 0.6039],
+          [0.3020, 0.2745, 0.3020,  ..., 0.6235, 0.5882, 0.6157]]]), caption='Cello Renting vs. Buying: Which is Right for You?')
+```
+
+Awesome, it returns a {py:class}`CaptioningSample <megatron.energon.CaptioningSample>` with the attributes
+
+- `__key__`: `part_00123/00403.tar/004030195`, the identifier of the sample like `TAR_FILE/INDEX`
+  - All sample types will have a key. It's in the base class {py:class}`Sample <megatron.energon.Sample>` 
+- `image`: The image as a tensor of shape `(1, 3, 267, 400)` (RGB image in a batch of size 1)
+- `caption`: A list of strings (here just one since batch size is one) 
+
+Let's also talk about the {py:class}`WorkerConfig <megatron.energon.WorkerConfig>`. As energon is made for distributed training,
+you always need to provide a worker config to the dataset so specify how many ranks and workers there are and which rank you're currently on.
+For this simple tutorial, we don't really distribute the work, so we use only a single rank with 4 workers. Check out the helper method {py:meth}`default_worker_config <megatron.energon.WorkerConfig.default_worker_config>` to see how the worker config is constructed. Also don't be afraid to click the *`[source]`* link and look at the very short source code of it.
+
+## Tutorial 3: Batch Size
+
+Actually, we would like to use a `batch_size` of more than one, let's go with 2 for now.
+
+```python
+from megatron.energon import get_train_dataset, get_loader, WorkerConfig
+
+loader = get_loader(get_train_dataset(
+    '/path/to/your/dataset',
+    batch_size=2,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+))
+
+for batch in loader:
+    print(batch)
+    break
+```
+
+The output will be similar to above but with different shapes and lengths:
+
+- `batch.__key__`: A list of two keys
+- `batch.image`: Tensor of shape `(2, 3, 267, 400)`
+- `batch.caption`: A list of two caption strings
+
+The default [task encoder](task_encoder) automagically padded and stacked the items to a batch.
+This may be ok for some cases, but usually you will want to process and batch your data differently.
+
+Hence, we can
+
+- either use an existing task encoder
+- or define a custom one (see [](task_encoder))
+
+## Tutorial 3: Blending using Metadataset
+
+A typical use case is to blend multiple datasets of the same (or similar type) together.
+For example, you may want to blend the COCO dataset with the COYO dataset.
+The easiest way to do this, is to use the metadataset pattern. 
+For this you need to create a new `yaml` file that defines the meta dataset:
+
+`coyo-coco-dataset.yaml`:
+```yaml
+__module__: megatron.energon
+__class__: MetadatasetV2
+splits:
+  # Train dataset, the datasets will be blended according to their weights 
+  train:
+    blend:
+      - weight: 5
+        path: ./coco
+      - weight: 2
+        path: ./coyo
+  # For val and test, datasets will be concatenated
+  val:
+    path: ./coco
+  test:
+    path: ./coyo
+```
+
+This assumes, that the datasets `coyo` and `coco` exist in subfolders next to the `coyo-coco-dataset.yaml` file. You could also use absolute paths, but that will not work well when using object storage such as S3.
+
+Usage in your loader, simply use {py:func}`get_train_dataset <megatron.energon.get_train_dataset>`:
+```python
+from megatron.energon import get_train_dataset, get_loader, WorkerConfig
+
+loader = get_loader(get_train_dataset(
+    'coyo-coco-dataset.yaml',
+    batch_size=4,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+))
+
+for batch in loader:
+    print(batch)
+    break
+
+```
+
+If you need to handle samples from different datasets differently in your pipeline, you will want to use `subflavors`.
+For these and other details, check out the [](metadataset) section. Energon also supports blending by specifying the number of repetitions for each dataset using [](../advanced/epochized_blending).
+
+## Tutorial 4: Distributed Loading
+
+For multi-GPU support, you may need to adapt the worker config.
+So far we have only used the default worker config, which you can get by calling {py:func}`WorkerConfig.default_worker_config() <megatron.energon.WorkerConfig.default_worker_config>`.
+This default config tries to infer your multi-GPU setup by using `torch.distributed`, which is fine in most cases.
+If you are not using any distributed setup, the default config will work, too. In that case, it assumes a single local rank.
+
+However, if you have a more complex multi-node setup and other non-data-parallel stragies, you may need to set it up yourself. 
+The following example shows how it could be set.
+
+```python
+from megatron.energon import get_train_dataset, get_loader, WorkerConfig
+import torch.distributed as dist
+
+worker_config = WorkerConfig(
+    rank=SET_YOUR_GLOBAL_DATA_RANK_HERE,
+    world_size=SET_YOUR_GLOBAL_WORLD_SIZE_HERE,
+    num_workers=2,
+)
+
+loader = get_loader(get_train_dataset(
+    'coyo-coco-dataset.yaml',
+    batch_size=4,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=worker_config,
+))
+
+for batch in loader:
+    print(batch)
+    break
+```
+
+For a detailed discussion on parallelism, check out [](../advanced/parallelism).
+
+## Tutorial 5: Saving and Restoring
+
+For saving and restoring the state (e.g. when pausing and resuming training), the loader must be instantiated with the savable loader.
+
+```python
+from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig
+
+ds = get_train_dataset(
+    'coyo-coco-dataset.yaml',
+    batch_size=4,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+)
+
+# Must use the savable loader here. This provides methods to save
+# and load the state of the data loader
+loader = get_savable_loader(ds)
+
+for i, batch in zip(range(10), loader):
+    print(batch)
+    break
+
+# Save the state
+state = loader.save_state_rank()
+# Could save the state now using torch.save()
+
+# ... when loading:
+# Could load the state with torch.load()
+
+# Restore the state for a new loader
+ds = get_train_dataset(
+    'coyo-coco-dataset.yaml',
+    batch_size=4,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    worker_config=WorkerConfig.default_worker_config(),
+)
+loader = get_savable_loader(ds)
+loader.restore_state_rank(state)
+```
+
+We provide code for different scenarios for saving and loading in distributed settings especially in the section [](save_restore).
+
+## More Features
+
+Check out the topics in Advanced Usage for details on specific features.
--- a/Megatron-Energon/docs/source/basic/save_restore.md
+++ b/Megatron-Energon/docs/source/basic/save_restore.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Save and Restore
+
+For long-running training jobs, you will usually need to stop and resume the training including the data loader.
+One of energon's unqiue features is the deterministic save and restore capability.
+
+At any iteration, you'll be able to store the overall state of the data loader across all ranks and accurately resume it later on, to continue where it left off.
+Below, we list a few different ways to achieve that.
+
+## Scenarios
+
+All these scenarios work in distributed and non-distributed settings.
+
+For simplicity, we are assuming the same loader and data for all scenarios from the [](../basic/quickstart):
+```python
+from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig
+
+worker_config = WorkerConfig.default_worker_config()
+
+def get_my_loader():
+    return get_savable_loader(get_train_dataset(
+        'coyo-coco-dataset.yaml',
+        batch_size=4,
+        shuffle_buffer_size=100,
+        max_samples_per_sequence=100,
+        worker_config=worker_config,
+    ))
+
+```
+
+### 1. Save/Restore the State per Rank Separately
+
+In this scenario, each rank saves and restores its own state in an independent file.
+This is our recommended way, since it avoids transferring the data across ranks.
+
+```python
+# Saving the state
+loader = get_my_loader()
+
+# Iterate for some steps
+for i, batch in zip(range(10), loader):
+    print(batch)
+    break
+
+# Save the state
+state = loader.save_state_rank()
+# Save the state on each rank
+# In this example, save the state using `torch.save`, this can of course be custom
+torch.save(dataloader_state, f'dataloader_state_rank{worker_config.rank}.pth')
+```
+
+```python
+# Restoring the state
+loader = get_my_loader()
+
+# Now, when restoring the state:
+state = torch.load(f'dataloader_state_rank{worker_config.rank}.pth')
+
+# Restore the state for the loader on each rank separately
+loader.restore_state_rank(state)
+```
+
+
+### 2. Save/Restore the State on the Primary Rank Only
+
+In this scenario, the primary rank (usually rank 0) is responsible for saving the state.
+All ranks' states are collected (gathered) by one rank and can be stored in one file.
+When restoring, the state is scatterd from the primary rank to all other ranks.
+This approach centralizes the state management, which can simplify the process and reduces the number of files stored.
+
+```python
+# Saving the state
+loader = get_my_loader()
+
+# Iterate for some steps
+for i, batch in zip(range(10), loader):
+    print(batch)
+    break
+
+# Save the state to primary rank 0
+state = loader.save_state_global(dst_rank=0)
+if worker_config.rank == 0:
+    # Only rank 0 has the state now, for the others, the state is None
+    # In this example, save the state using `torch.save`, this can of course be custom
+    torch.save(dataloader_state, 'dataloader_state.pth')
+```
+
+```python
+# Restoring the state
+loader = get_my_loader()
+
+# Load the state only on the primary rank
+if worker_config.rank == 0:
+    state = torch.load('dataloader_state.pth')
+else:
+    state = None
+
+# Restore the state for the loader, broadcasting from rank 0
+loader.restore_state_global(state, src_rank=0)
+```
+
+
+```{admonition} Note
+:class: important
+Even though only one rank collects the states, all ranks need to execute the `loader.save_state_global()` and `loader.restore_state_global()` lines of code
+```
+
+### 3. Save the State on the Primary Rank, Restore on Ranks Separately
+
+In this scenario, the primary rank saves the state, but each rank restores the state separately. Each rank loads all saved states and selects the correct one. This approach combines centralized saving with distributed restoring and is rather uncommon.
+
+Depending on the framework used for training, that framework may already handle the scattering/gathering of the states. In that case, refer to the first scenario using `save_state_rank`/`restore_state_rank`.
+
+```python
+# Saving the state
+loader = get_my_loader()
+
+# Iterate for some steps
+for i, batch in zip(range(10), loader):
+    print(batch)
+    break
+
+# Save the state
+state = loader.save_state_global(dst_rank=0)
+if worker_config.rank == 0:
+    # In this example, save the state using `torch.save`, this can of course be custom
+    torch.save(dataloader_state, 'dataloader_state.pth')
+```
+
+```python
+# Restoring the state
+loader = get_my_loader()
+
+# Load on all ranks
+state = torch.load('dataloader_state.pth')
+
+# Restore the state for the loader on current rank, using all ranks checkpoint
+loader.restore_state_global(state, src_rank=None)
+```
+
+## Summary
+
+In each of these scenarios, ensure that the logic for saving and restoring the state is appropriately synchronized across ranks to maintain consistency.
+If you encounter torch distributed errors, likely torch distributed calls are out of sync, or not all ranks are called correctly. If unsure, debug using the first scenario, saving each rank separately.
--- a/Megatron-Energon/docs/source/basic/task_encoder.md
+++ b/Megatron-Energon/docs/source/basic/task_encoder.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Task Encoder
+
+The "Task Encoder" is an Energon-specific concept. It is a class that describes how each sample is going to be processed at different stages of the pipeline.
+Please also take a look at [](basics_flow) to see the pipeline.
+
+If you don't specify any task encoder, a default version, the {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>` will be used.
+
+For writing your own task encoder, create a class based on {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>`
+and override one or more of the following methods. The data flow of {py:func}`get_train_dataset <megatron.energon.get_train_dataset>` or {py:func}`get_val_dataset <megatron.energon.get_val_dataset>` is as follows:
+
+- {py:meth}`def cook_crude_sample(self, sample: Union[T_sample, CrudeSample]) -> T_sample <megatron.energon.TaskEncoder.cook_crude_sample>`
+  - Optional. Define when using [crude data](crude-data).
+- {py:func}`def encode_sample(self, sample: T_sample) -> T_encoded_sample <megatron.energon.DefaultTaskEncoder.encode_sample>`
+  - Transform the raw data from the dataset (e.g. augment/transform images, tokenize a single sample).
+  - Alternatively, you can use {py:meth}`preencode_sample <megatron.energon.TaskEncoder.preencode_sample>` and {py:meth}`postencode_sample <megatron.energon.TaskEncoder.postencode_sample>`. This is especially useful when you use packing and you want to defer loading to when the packed samples are known.
+- {py:meth}`def preencode_sample(self, sample: T_sample) -> T_sample <megatron.energon.TaskEncoder.preencode_sample>`
+  - Can be used as a replacement to {py:func}`encode_sample <megatron.energon.DefaultTaskEncoder.encode_sample>`
+  - Only used together with {py:meth}`postencode_sample <megatron.energon.TaskEncoder.postencode_sample>` (below). Use it if you need to instantiate lazy data before packing the selected samples. But it will also work when not using packing.
+- {py:meth}`def select_samples_to_pack(self, samples: List[T_encoded_sample]) -> List[List[T_encoded_sample]] <megatron.energon.TaskEncoder.select_samples_to_pack>`
+  - Optional. Allows for efficient sample packing. See [](../advanced/packing).
+- {py:meth}`def postencode_sample(self, sample: T_sample) -> T_encoded_sample <megatron.energon.TaskEncoder.postencode_sample>`
+  - Only used together with {py:meth}`preencode_sample <megatron.energon.TaskEncoder.preencode_sample>`. Use it if you need to instantiate lazy data before packing the selected samples. But it will also work when not using packing.
+- {py:meth}`def pack_selected_samples(self, samples: List[T_encoded_sample]) -> T_batch_sample] <megatron.energon.TaskEncoder.pack_selected_samples>`
+  - Required if select_samples_to_pack is used. Compresses a group of samples to a single sample.
+- (samples are collected for a batch)
+- {py:meth}`def batch(self, batch: List[T_encoded_sample]) -> T_raw_batch <megatron.energon.DefaultTaskEncoder.batch>`
+  - Collate the batch to a single sample, defaults to padded batching for tensors, lists for everything else.
+- {py:meth}`def encode_batch(self, batch_data: T_raw_batch) -> T_batch <megatron.energon.DefaultTaskEncoder.encode_batch>`
+  -  Transform the batched data (e.g. tokenize the whole batch).
+- (optionally limit the dataset size, based on the `limit` argument)
+- (optionally epochize the dataset)
+- (move data from the worker to the main process through the {py:class}`torch.data.DataLoader` via {py:func}`get_loader <megatron.energon.get_loader>`)
+- For batches based on {py:class}`Batch <megatron.energon.Batch>`, call {py:meth}`def pin_memory(self, batch: T_batch) -> T_batch <megatron.energon.Batch.pin_memory>`, or if not a dataclass, use default torch pinning (this must happen in the main process, thus after data loading)
+
+If a sample or batch is to be ignored, any of these methods may raise {py:class}`IgnoreSample <megatron.energon.IgnoreSample>` to skip the sample being processed.
+
+The types `T_sample`, `T_encoded_sample`, `T_raw_batch` and `T_batch` are generics and depend on your task. You do not necessarily have to specify them, it's only used for proper typing in your IDE.
+
+```python
+from dataclasses import dataclass
+from typing import Callable, List, Optional
+
+import torch
+
+from megatron.energon import Batch, CaptioningSample, DefaultTaskEncoder, batch_list, batch_stack
+
+
+# Type for intermediate batch, after batching operation
+@dataclass
+class CaptioningRawBatch(Batch):
+    # (n, c, h, w)
+    image: torch.Tensor
+    # (n,)
+    caption: List[str]
+
+
+# Typing for the resulting batch data
+@dataclass
+class CaptioningBatch(Batch):
+    # (n, c, h, w)
+    images: torch.Tensor
+    # (n, c)
+    text_tokens: torch.Tensor
+    # (n, c, c)
+    text_attn_mask: torch.Tensor
+
+
+# All the typing is optional
+class CaptioningTaskEncoder(
+    DefaultTaskEncoder[CaptioningSample, CaptioningSample, CaptioningRawBatch, CaptioningBatch]
+):
+    """A simple task encoder for captioning."""
+
+    decoder = SampleDecoder(image_decode="torchrgb")
+
+    def __init__(
+        self,
+        tokenizer: Tokenizer,
+        image_transform: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,
+        max_length: int = 128,
+    ):
+        # Specify the batch_type for default batching (batching is performed here "manually" by overwriting the `batch`
+        # method)
+        super().__init__(batch_type=CaptioningRawBatch)
+        self.tokenizer = tokenizer
+        self.image_transform = image_transform
+        self.max_length = max_length
+
+    def encode_sample(self, sample: CaptioningSample) -> CaptioningSample:
+        sample.image = self.image_transform(sample.image)
+        return sample
+    
+    def batch(self, samples: List[CaptioningSample]) -> CaptioningRawBatch:
+        # Batch the samples
+        # By default, `batch_pad_stack` is used for all tensor fields, and `batch_list` is used for all non-tensor 
+        # fields. This example matches the default implementation (not overwriting the `batch` method).
+        return CaptioningRawBatch.from_samples(samples)
+
+    def encode_batch(self, batch_data: CaptioningRawBatch) -> CaptioningBatch:
+        # Run the encoder on the batch of captions.
+        tokenized = self.tokenizer(batch_data.caption)
+        # Return the final batch, going into the network
+        return CaptioningBatch.derive_from(
+            batch_data,
+            images=batch_data.image,
+            text_tokens=tokenized["input_ids"],
+            text_attn_mask=tokenized["attention_mask"],
+        )
+
+```
+
+If you're wondering about the `decoder` assignment, check out [](../basic/data_decoding).
+
+Usage in your training script:
+```python
+    
+from torchvision import transforms
+from transformers import AutoTokenizer
+from megatron.energon import get_loader, get_train_dataset
+
+    
+train_img_transform = transforms.Compose(
+    [
+        transforms.RandomResizedCrop((224, 224)),
+        transforms.RandomHorizontalFlip(),
+    ]
+)
+
+train_loader = get_loader(get_train_dataset(
+    '/my/dataset/path',
+    batch_size=32,
+    shuffle_buffer_size=100,
+    max_samples_per_sequence=100,
+    task_encoder=CaptioningTaskEncoder(
+        tokenizer=AutoTokenizer.from_pretrained('gpt2'),
+        image_transform=train_img_transform,
+    ),
+))
+
+for data in train_loader:
+    # data is a CaptioningBatch
+    pass
+
+```
--- a/Megatron-Energon/docs/source/conf.py
+++ b/Megatron-Energon/docs/source/conf.py
+# Copyright (c) 2025, NVIDIA CORPORATION.
+# SPDX-License-Identifier: BSD-3-Clause
+# -*- coding: utf-8 -*-
+
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+# src_folder = pathlib.Path(__file__).parents[2]
+# print("Copying README.md to docs/source/development_setup.md")
+# shutil.copyfile(
+#     str(src_folder / ".." / "README.md"), str(src_folder / "docs" / "source" / "development_setup.md")
+# )
+
+# Add path to energon module
+sys.path.insert(0, os.path.abspath("../../src"))
+
+# -- Project information -----------------------------------------------------
+
+project = "megatron-energon"
+copyright = "2025 NVIDIA Corporation"
+author = "Lukas Voegtle, Philipp Fischer"
+
+# The short X.Y version
+version = ""
+# The full version, including alpha/beta/rc tags
+release = ""
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "sphinx.ext.autodoc",
+    "sphinx.ext.viewcode",
+    "sphinx.ext.mathjax",
+    "sphinx.ext.napoleon",
+    "myst_parser",  # markdown(*.md) parser
+    "sphinx_click",
+]
+
+# Autodoc
+autodoc_mock_imports = [
+    "braceexpand",
+    "fsspec",
+    # "torch",
+    "webdataset",
+    "tqdm",
+    "numpy",
+    "PIL",
+    "s3fs",
+]
+autodoc_typehints = "description"
+autodoc_typehints_format = "short"
+
+# Mock everything here, so that not just autodoc, but also sphinx_click can make use of the mock imports
+from sphinx.ext.autodoc.mock import MockFinder
+
+sys.meta_path.insert(0, MockFinder(autodoc_mock_imports))
+
+# Napoleon
+napoleon_google_docstring = True
+napoleon_numpy_docstring = True
+napoleon_include_init_with_doc = True
+napoleon_include_private_with_doc = False
+napoleon_include_special_with_doc = False
+napoleon_use_admonition_for_examples = False
+napoleon_use_admonition_for_notes = False
+napoleon_use_admonition_for_references = False
+napoleon_use_ivar = False
+napoleon_use_param = True
+napoleon_use_rtype = True
+napoleon_use_keyword = True
+napoleon_custom_sections = None
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+# source_suffix = ".rst"
+source_suffix = {
+    ".rst": "restructuredtext",
+    ".md": "markdown",
+}
+
+# The master toctree document.
+master_doc = "index"
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = "en"
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_rtd_theme"
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    "prev_next_buttons_location": "both",
+}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+html_css_files = ["css/custom.css"]
+
+# Favicon configuration
+html_favicon = "_static/favicon.ico"
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = "energondoc"
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = []
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(master_doc, "megatron-energon", "Megatron-Energon Documentation", [author], 1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = []
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ["search.html"]
+
+# -- Extension configuration -------------------------------------------------
--- a/Megatron-Energon/docs/source/images/data_flow.drawio
+++ b/Megatron-Energon/docs/source/images/data_flow.drawio
--- a/Megatron-Energon/docs/source/images/data_flow.png
+++ b/Megatron-Energon/docs/source/images/data_flow.png
--- a/Megatron-Energon/docs/source/images/energon3_darkbg_border.png
+++ b/Megatron-Energon/docs/source/images/energon3_darkbg_border.png
--- a/Megatron-Energon/docs/source/images/joining.drawio
+++ b/Megatron-Energon/docs/source/images/joining.drawio
--- a/Megatron-Energon/docs/source/images/joining.png
+++ b/Megatron-Energon/docs/source/images/joining.png
--- a/Megatron-Energon/docs/source/index.md
+++ b/Megatron-Energon/docs/source/index.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Megatron-Energon Documentation
+
+This is the documentation of Megatron's multi-modal data loader "Energon".
+
+We recommend getting started in the [Introduction](intro/introduction) section, which explains what Energon is and how to install it.
+
+Once installed, check out the **Basic Usage** section starting with [Quickstart](basic/quickstart) for some basic examples and tutorials.
+Some underlying concepts, will be explained in the rest of that section.
+
+For specific use cases and advanced usage, please read **Advanced Usage**.
+
+In the end you will also find some documentation on how to interface with energon programmatically and how to contribute to the code base.
+
+```{toctree}
+---
+caption: Introduction
+maxdepth: 2
+---
+
+intro/introduction
+intro/installation
+```
+
+
+```{toctree}
+---
+caption: Basic Usage
+maxdepth: 2
+---
+basic/quickstart
+basic/data_prep
+basic/data_decoding
+basic/basics_flow
+basic/task_encoder
+basic/metadataset
+basic/save_restore
+basic/glossary
+```
+
+
+```{toctree}
+---
+caption: Advanced Usage
+maxdepth: 2
+---
+advanced/remote_dataset
+advanced/crude_datasets
+advanced/custom_sample_loader
+advanced/repro_scaling
+advanced/packing
+advanced/grouping
+advanced/joining_datasets
+advanced/subsets
+advanced/epochized_blending
+advanced/custom_blending
+advanced/parallelism
+```
+
+
+```{toctree}
+---
+caption: API
+maxdepth: 2
+---
+api/modules
+api/cli
+```
+
+
+```{toctree}
+---
+caption: Internals
+maxdepth: 2
+---
+internals/contrib_guidelines
+internals/code_structure
+```
+
+# Indices and tables
+
+- [](genindex)
+- [](modindex)
--- a/Megatron-Energon/docs/source/internals/code_structure.md
+++ b/Megatron-Energon/docs/source/internals/code_structure.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Code Structure
+
+This section is meant to provide an introduction to Megatron Energon for developers who want to cotribute to energon itself.
+
+For now, this is still a placeholder and we encourage you to get in touch with us for an introduction.
--- a/Megatron-Energon/docs/source/internals/contrib_guidelines.md
+++ b/Megatron-Energon/docs/source/internals/contrib_guidelines.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Contribution Guidelines
+
+If you want to contribute to this repository please adhere to the following guidelines
+
+- Always use [black](https://pypi.org/project/black/) and [isort](https://pycqa.github.io/isort/) to format your code before committing
+- Check that all license headers are present using `python3 scripts/license_headers.py --fix .`
+- Python `@dataclass` and `NamedTuple` are preferred over dictionaries, which don't allow for IDE
+  auto-completion and type checking
+- User-exposed classes and methods should be documented in Google-style docstrings that are parsed by sphinx
+  and end up in this documentation
+- Breaking changes should be marked in the message of pull requests:
+  - `CHECKPOINT BREAKING CHANGE`: When the save/restore structure changed incompatibly (check test `test_metadataset:TestDataset.test_save_restore_state_train`)
+  - `ITERATION ORDER BREAKING CHANGE`: When the order of iterating samples changed, i.e. experiments would not be exactly reproducible (check tests `test_dataset:TestDataset.test_current_batch_index_generator`, `test_dataset:TestDataset.test_current_batch_index`, maybe more)
+  - `API BREAKING CHANGE`: When the external programming api changed incompatibly
+  - `DATASET CONFIG BREAKING CHANGE`: When the dataset config (`.nv-meta` folder) changed incompatibly
+  - `METADATASET CONFIG BREAKING CHANGE`: When the metadataset config changed
+- In a release, all breaking changes except checkpoint lead to a new major version.
--- a/Megatron-Energon/docs/source/intro/installation.md
+++ b/Megatron-Energon/docs/source/intro/installation.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# Installation
+
+If you simply want to use this package without modifying it, the best option is to install it 
+as a dependency of your project like you would with any other pip package.
+
+## Normal Installation
+
+To install the most recent release version, run
+
+```shell
+pip install megatron-energon
+```
+
+in your project's Python environment, which could be a virtualenv, or a conda environment.
+You can even install it inside a `Dockerfile` to include it in your custom docker container.
+
+If you want to use [remote datasets](../advanced/remote_dataset) or [audio/video decoding](av-decoder), you
+need to provide *extras* to the installation command, for example like
+
+```shell
+pip install megatron-energon[s3,av_decode]
+```
+
+For all available extras, check out the above links and the `pyproject.toml` file.
+
+## Installation for Development
+
+If you want to manage, debug or modify the code of energon itself, we recommend that you clone this repository
+on your disk.
+
+You can then install the package in **editable** mode.
+This way, you can use energon and its CLI scripts while still being able to modify the source code.
+
+First, check out the repository locally:
+```shell
+git clone https://github.com/NVIDIA/Megatron-Energon.git megatron-energon
+```
+
+Then install with your favorite tooling:
+
+### Editable installation with uv and just
+
+* `uv` is a fast modern tool that can replace legacy tools like pip, conda and virtualenv.
+* `just` is command runner that simplifies common tasks using the `justfile` we provide.
+
+Check out the [official website](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer) on how to install `uv`.
+On [this page](https://github.com/casey/just?tab=readme-ov-file#packages) you can find out how to install `just`.
+
+
+Then, to setup a `.venv` and install energon in editable mode:
+```shell
+cd megatron-energon
+just dev-sync
+```
+
+The `dev-sync` command will setup a local virtual environment in `.venv` and install all dependencies.
+It will also install energon in editable mode for development inside that venv.
+
+Activate the environment
+```shell
+. .venv/bin/activate
+```
+
+Now you can call the `energon` command.
+
+You can also use `just` to do a bunch of other things shown below.
+Note that you don't need to activate the venv before running those.
+
+```shell
+# Run all unit tests
+just test
+
+# Run the code linter and format check
+just check
+
+# Build the documentation
+just docs
+
+# Show all available commands
+just help
+```
+
+### Editable installation with pip
+
+First make sure you are in some python environment where you want to set up energon.
+Then install in development mode:
+```shell
+pip install -e ./megatron-energon
+```
+
+```{warning}
+**We discourage importing the cloned repo without pip install** 
+- You will not be able to use the command line tool
+- You would have to use hacks to get the package into your `PYTHONPATH`
+- You would need to take care of the dependencies yourself. 
+
+Instead, simply install in development mode.
+```
--- a/Megatron-Energon/docs/source/intro/introduction.md
+++ b/Megatron-Energon/docs/source/intro/introduction.md
+<!--- Copyright (c) 2025, NVIDIA CORPORATION.
+SPDX-License-Identifier: BSD-3-Clause -->
+
+# General
+
+Megatron-Energon is a data loader that works best with your [Megatron](https://github.com/NVIDIA/Megatron-LM) project.
+However, you can use it in any of your PyTorch-based deep learning projects.
+
+What can it offer compared to other data loaders?
+
+The most important features are:
+
+* Comes with a standardized WebDataset-based format on disk
+* Optimized for high-speed multi-rank training
+* Can handle very large datasets
+* Can easily mix and blend multiple datasets
+* Its state is savable and restorable (deterministic resumability)
+* Handles various kinds of multi-modal data even in one training run
+
+Energon also comes with a command line tool that you can use to prepare your datasets.
--- a/Megatron-Energon/justfile
+++ b/Megatron-Energon/justfile
+# https://github.com/casey/just
+
+# List justfile recipes
+help:
+    just --list
+
+# Update the environment with the latest version of the dependencies
+dev-sync:
+    uv sync --all-extras --cache-dir .uv_cache
+
+# Update the environment but not with the development dependencies
+prod-sync:
+    uv sync --all-extras --no-dev --cache-dir .uv_cache
+
+# Fix the code style and format
+fix: dev-sync
+    uv run ruff check --fix
+    uv run ruff format
+    uv run scripts/license_headers.py src --fix
+    uv run scripts/license_headers.py tests --fix
+
+# Execute the ruff code linter and format checker
+check: dev-sync
+    uv run ruff check
+
+# Execute all unit tests
+test: dev-sync
+    uv run -m unittest discover -v -s tests
+
+# Build the docs
+docs: dev-sync
+    uv run sphinx-build -b html docs/source docs/build
+
+# Build the release package
+build: dev-sync
+    rm -rf dist
+    uv build --wheel
+    uv build --sdist