Commit f356f546 authored by maming's avatar maming
Browse files

Initial commit

parents
Pipeline #3339 canceled with stages
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Data Flow
![energon data flow](../images/data_flow.png)
The steps of how the data flows through those task encoder methods is explained in [](task_encoder).
(flavors_general)=
## Dataset Flavors
The datasets are organized in "flavors", i.e. each modality returned by the dataset is a "flavor".
A modality can for example be a {py:class}`CaptioningSample <megatron.energon.CaptioningSample>` or an
{py:class}`VQASample <megatron.energon.VQASample>`. The dataset class combines the source data format
and the iterated sample format. For example, the {py:class}`CaptioningWebdataset <megatron.energon.CaptioningWebdataset>`
combines the webdataset loader with the {py:class}`CaptioningSample <megatron.energon.CaptioningSample>`.
For all types, see [](sect-sample-types)
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Data Decoding
When iterating your dataset, the first processing step that each sample goes through, even before [](sample-loading), is the decode step.
As explained [here](wds-format), each sample in the underlying `.tar` file can have multiple entries with different extensions.
Just like WebDataset itself, energon uses the [auto decode](https://rom1504.github.io/webdataset/api/webdataset/autodecode.html) functionality of WebDataset to transform those raw bytes from the tar file entries
to usable objects. Images for example shall be decoded to arrays or tensors of pixel data.
In Energon, the class responsible for data decoding is {py:class}`DefaultDecoderWebdatasetFactory <megatron.energon.DefaultDecoderWebdatasetFactory>` (which usually you don't need to touch directly). In its `__init__` method, it contains a piece of code that initializes the auto decoder and tells it which data types shall be decoded and how.
Decoders typically convert
* Text to strings (instead of bytes)
* JSON to dictionaries and lists
* Images, videos and audios to pixel tensors or audio sample tensors
* For other options see below
* Torch PTH files to tensors
* NPY files to numpy arrays
## How to control data decoding
Starting with Energon 7.0.0, the new way to configure data decoding is to specify a `decoder` as a class variable of your [Task Encoder](../basic/task_encoder):
```python
class MyTaskEncoder(DefaultTaskEncoder):
decoder = SampleDecoder(image_decode="pilrgb")
```
Typically, you will just instantiate a {py:class}`SampleDecoder <megatron.energon.SampleDecoder>` and provide the arguments to configure it, as shown above.
If you do **not** want automatic decoding at all, you have to explicitely set `decoder = None` in your TaskEncoder.
Here are the different options you can pass to {py:class}`SampleDecoder <megatron.energon.SampleDecoder>`:
* `image_decode` (str)
* Can be set to an image decoder from webdataset. Here are some examples:
* `pil`: Returns the image as a PIL image
* `torchrgb` Returns the image as a torch tensor with 3 color channels.
* For more options, check out the [official documentation](https://rom1504.github.io/webdataset/api/webdataset/autodecode.html#ImageHandler).
* `av_decode` (str)
* Can be one of `AVDecoder`, `torch`, `pyav`. The default is `AVDecoder` which is explained [below](av-decoder).
* The option `torch` would decode video and audio entirely and return them as tensors.
* The `pyav` option is for advanced use cases where you need direct access to the object returned by `av.open()`
* `video_decode_audio` (bool)
* If `True`, videos that have an audio track will decode both the video and the audio. Otherwise, only the video frames are decoded.
* `guess_content` (bool)
* *New in Energon 7.0.0*
* Whether to guess the contents of the file using the `filetype` package. Useful if you have files without extensions in your data.
### Legacy method before Energon 7.0.0
```{warning}
The below method of configuring auto decoding was deprecated in Energon 7.0.0. Please migrate to the above new method with `SampleDecoder`
```
In older versions of Energon, you could pass arguments when calling {py:func}`get_train_dataset <megatron.energon.get_train_dataset>` or {py:func}`get_val_dataset <megatron.energon.get_val_dataset>`.
The arguments are more or less identical to what can be passed to `SampleDecoder` above, except:
* `auto_decode` (bool)
* Set to `False` to disable all automatic decoding of the data. In your sample loader, you will then get raw bytes. The default setting is `True`
* Setting to `False` is equivalent to setting `decoder = None` in the new version.
* `guess_content` (bool)
* Not available in older versions
(av-decoder)=
## AV Decoder (Audio and Video)
Energon comes with code to efficiently decode compressed video and audio files such as MP4, MP3 or WAV.
It integrates a library called `fastseek` that allows to quickly extract sub-clips from longer videos.
The actual decoding is done by PyAV which in turn relies on ffmpeg.
When choosing `AVDecoder` as the option for `av_decode` above, an object of the type {py:class}`AVDecoder <megatron.energon.av.AVDecoder>` (click to see methods) will be returned.
At this point, the file has not yet been decoded, but in your [custom sample loader](custom-sample-loader) or in your [TaskEncoder](../basic/task_encoder), you can read parts or all of the file using the methods of `AVDecoder`.
For example to extract from a video, the first 2 seconds and the 2 seconds starting after 4 seconds:
```py
dat = av_decoder.get_clips(
video_clip_ranges=[(0, 2), (4, 6)],
video_unit="seconds",
)
# Print both clips' tensor shapes
print(dat.video_clips[0].shape)
print(dat.video_clips[1].shape)
```
The return value of {py:meth}`get_clips <megatron.energon.av.AVDecoder.get_clips>` will be an object of type {py:class}`AVData <megatron.energon.av.AVData>`. It may contain video or audio tensors or both, depending on what you requested. It will also contain the timestamps of the actually returned clips.
Click on {py:class}`AVData <megatron.energon.av.AVData>` to see the fields and the shapes of the tensors.
You can also use helper methods to extract multiple clips at equal distances all at once:
```py
from megatron.energon.av import get_clips_uniform
dat = get_clips_uniform(
av_decoder=av_decoder, num_clips=5, clip_duration_seconds=3, request_audio=True
)
```
This would extract 5 clips, each 3 seconds long also including the corresponding audio clips.
The first clip will start at position 0 and the last clip would end at the end of the video.
This is essentially just a convenience wrapper around `get_clips`, so the code may be a good starting point if you're looking to write a custom extraction logic.
We also provide a method {py:func}`get_single_frames_uniform <megatron.energon.av.get_single_frames_uniform>` which will return a tensor of frames directly instead of an `AVData` object.
The simplest case is to decode the whole video or audio or both:
```py
dat = av_decoder.get_video()
# or
dat = av_decoder.get_audio()
# or
dat = av_decoder.get_video_with_audio()
```
\ No newline at end of file
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
(data-preparation)=
# Data Preparation
The aim of data preparation is to convert your data to a format that the energon loader can understand and iterate.
Energon's primary data format is [WebDataset](https://github.com/webdataset/webdataset) with some extra information stored in a folder called `.nv-meta`.
Below in [](data-on-disk) we explain the details about this format.
We also support a simpler JSONL format which will always be interpreted as [crude data](crude-data).
## Important Considerations
Depending on what your data looks like and how you are planning to use it, you will have to make a few choices,
**before you prepare your dataset**:
**Monolithic Dataset vs. Polylithic (primary and auxiliary) Datasets**
You can include the media (images/video/audio) inside the same webdataset along with the text and metadata of each sample.
Or you can keep the media separate (either in another indexed webdataset or as individual files on disk).
When using JSONL, the media will always be separate, so JSONL datasets are always polylithic unless they are text-only.
If you can, you should go for the monolithic option, because it's faster to load.
However, there are a few reasons why the other option may be needed:
* You need to keep the original media and you don't want to duplicate it
* Your media data is very large (e.g. long videos) and you need to keep your primary dataset small (containing just the text-based data and meta information)
* You want to re-use the same media with different labels or you want to train on different subsets
* You want to train with [online packing](../advanced/packing.md) and can't fit all the media of the packing buffer in memory. With polylithic datasets you can use caching to avoid that issue.
**How to shard the data**
When using a WebDataset, it will be split into a bunch of shards (i.e. tar files). You'll have to decide how many samples to put in one shard and how many shards to get overall.
To maximize the loading speed, use as few shards as possible. Even a single shard can work well!
However, if you cannot handle files above a certain size you may need to split the shards more.
A good rule of thumb is to keep your **number of shards below 10k**.
If you are using remote filesystems like S3, there may be an opposing constraint: S3 [limits the number of requests per second](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html)
that you can make for a single prefix (e.g. filename). By using more shards, you can increase the overall rate. Ideally, you would still want to stay below 10k shards.
**Raw vs. baked data**
When using images for example, you could put either the encoded JPG, the decoded pixel values or even the encoded features into the dataset.
Typically, we recommend to go with the "original form" (e.g. JPG) and do all the processing on the fly inside the [cooker](crude-data) and [task encoder](../basic/task_encoder).
This way, you can change the processing and keep your dataset.
However, if the processing becomes a bottleneck, you can move some of it into the dataset creation phase by baking the information in.
Keep in mind that others may also want to use your dataset for a different project.
(monolithic-dataset)=
## Steps to Create a Monolithic Dataset
These are the typical steps to get your data ready:
1. Create a normal [WebDataset](https://github.com/webdataset/webdataset) from your data (including all the media content)
2. Run our preparation tool [`energon prepare`](energon-prepare) to create the additional metadata needed by energon. See [](data-on-disk).
(polylithic-dataset)=
## Steps to Create a Polylithic Dataset
1. Create the primary [WebDataset](https://github.com/webdataset/webdataset) or JSONL file from your text-based part of the data (meta information, labels, sizes etc.)
* Include the file names (don't use absolute paths) of the media that belongs to each sample (e.g. as strings inside a json entry)
2. Create the auxiliary dataset(s). Can be multiple datasets, e.g. one per modality.
* Either as a folder on disk with all the media files inside
* Or as another WebDataset that contains just the media files (with the exact same names)
3. Run our preparation tool `energon prepare` **on both datasets** (yes also on the JSONL) to convert to an energon-compatible format
* Configure both datasets as `CrudeWebdataset` (JSONL always is by default)
4. Create a [metadataset](../basic/metadataset) that specifies what auxiliary data to load for each primary dataset
* For more details read about [crude data](crude-data)
(create-jsonl-dataset)=
## Steps to Create a JSONL Dataset
A JSONL dataset is a simplified alternative to a full-blown WebDataset with tar files.
It has fewer features, but can easily be read using a standard editor.
```{admonition} Good to know
:class: tip
A JSONL dataset cannot contain media files, but it can reference media files elsewhere (auxiliary data).
It does not have a train/val/test split.
It cannot be used as an auxiliary dataset by other primary datasets.
It cannot be mounted using `energon mount`.
```
A single JSONL file will contain all of your text-based data, one JSON entry per line. For example:
```
{"id": 0, "question": "What is 1+2?", "answer": "3"}
{"id": 1, "question": "Who is Jensen Huang?", "answer": "The CEO of NVIDIA."}
```
And it is essentially equivalent to using a WebDataset with files
```
0.json
1.json
```
each file containing the JSON from one of the lines above.
None of the JSON fields is mandatory. The data is considered to be crude data and will be interpreted by your custom [cooker](crude-data).
If you want to include media, you should include file names of the media files in the JSON.
A metadataset with [auxiliary data](aux-data) can then be used to load the media on the fly.
Here's an example of how a polylithic JSONL dataset with images might look like:
```
{"image": "computer_01.jpg", "caption": "A desktop computer with two monitors."}
{"image": "mountains_123.jpg", "caption": "A beautiful landscape with mountains on a sunny day."}
```
Steps needed:
1. Create the JSONL file according to your needs
2. Run `energon prepare /path/to/my_dataset.jsonl` to create an index next to it
3. Optionally create a [metadataset](../basic/metadataset) that specifies what auxiliary data to load for each primary dataset
* For more details read about [crude data](crude-data)
The metadataset would then refer to the JSONL dataset while specifying the auxiliary data source:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
path: /path/to/my_dataset.jsonl
aux:
foo_bar_source: ./aux_ds123
image_source: filesystem://./relative_image_folder
```
An auxiliary data source can be a local or remote folder, or other energon-prepared webdatasets. Even multiple auxiliary sources can be used.
For all the options and to see how to specify a matching cooker, please check out the section on [auxiliary data](aux-data).
(wds-format)=
## Step 1: Creating a WebDataset
Example for a WebDataset (e.g. image captioning dataset):
```
shards
├── shard_0000.tar
│ ├── sample_0000.jpg
│ ├── sample_0000.txt
│ ├── sample_0000.detail.json
│ ├── sample_0001.jpg
│ ├── sample_0001.txt
│ └── sample_0001.detail.json
├── shard_0001.tar
│ ├── sample_0002.jpg
│ ├── sample_0002.txt
│ ├── sample_0002.detail.json
│ ├── sample_0003.jpg
│ ├── sample_0003.txt
│ └── sample_0003.detail.json
└── ...
```
In the example you can see two shards (i.e. tar files) with multiple samples. Each group of files with the same basename makes one sample.
So `sample_0000.jpg`, `sample_0000.txt` and `sample_0000.detail.json` are three parts that belong to the first sample.
This shows a monolithic dataset, for polylithic you would drop the JPGs in the primary dataset.
Note that each sample may have a different number of parts, for example some samples may have more images than others.
In this case, they should still have the same basename, for example `sample_0000.img1.jpg` and `sample_0000.img2.jpg`. For an advanced example for interleaved data, check out [this section](interleaved-sample-loader).
The order of samples in the tar file is important. Samples with the same base name (~before the first dot of the filename) must follow each other.
The base name is used to group the samples, i.e. in the example `sample_0000` is the first group name, with the part types `jpg`, `txt`, `detail.json`.
The default behavior of energon is to parse the contents by extensions (e.g. ending on `.json` will automatically use `json.loads`, `.png` will load the image).
### Building a WebDataset using Python
The easiest way to construct a WebDataset from existing data (e.g. from another torch dataset or a folder with files) is to use the ShardWriter from the webdataset library:
```py
import webdataset as wds
if __name__ == '__main__':
# Wherever your dataset comes from
my_dataset = ...
with wds.ShardWriter("parts/data-%d.tar", maxcount=10000) as shard_writer:
for key, data in my_dataset:
sample = {
"__key__": key,
"png": data['image'],
}
shard_writer.write(sample)
```
## Step 2: Preparing the Dataset
Once you have a WebDataset ready, you will want to prepare it for use with Energon.
This means adding additional meta data files next to the data.
This step does *not* change or copy the contents of your tar files.
Just run the `energon prepare /path/to/dataset` command, which will interactively walk you through the process.
The command will
* Search for all `*.tar` files in the given folder
* Index them so samples can be accessed randomly
* Ask you how you want to split the data into train/val/test paritions
* Ask you about the sample type (optionally crude)
* Ask you how to decode the data if not using crude data (field map or sample_loader.py)
* Store all this information in a subfolder `.nv-meta/`, see details [below](data-on-disk).
### Splitting the dataset into train/val/test
The first thing that the `energon prepare` assistant will ask you, is how you want to split the data by ratios.
However, if you have a pre-determined split, you can also pass that to energon. See the examples below.
#### Example 1: Let energon do the split
```text
shards
├── shard_0000.tar
├── shard_0001.tar
└── ...
```
Commandline:
```
> energon prepare ./
# Exemplary answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
image: jpg
caption: txt # if txt contains the caption
# or
caption: json[caption] # if .json contains {"caption": "My nice image"}
```
#### Example 2: Presplit shards by prefix
```text
shards
├── train_shard_0000.tar
├── train_shard_0001.tar
├── ...
├── val_shard_0000.tar
├── val_shard_0001.tar
└── ...
```
Commandline:
```
> energon prepare --split-parts 'train:shards/train_.*' --split-parts 'val:shards/val_.*' ./
```
Note that the pattern matching syntax uses regexes, so for arbitrary characters insert `.*` not just `*`
#### Example 3: Presplit shards by folder
```text
shards
├── train
│ ├── shard_00001.tar
│ ├── shard_00001.tar
│ └── ...
├── val
│ ├── shard_00001.tar
│ ├── shard_00001.tar
│ └── ...
└── ...
```
Commandline:
```
> energon prepare --split-parts 'train:shards/train/.*' --split-parts 'val:shards/val/.*' ./
```
```{admonition} Good to know
:class: tip
You can inspect your prepared dataset like a normal file system by using the [`energon mount`](energon-mount) feature.
```
### Sample Types
After the split is set up, the assistant will ask you which sample type you want to use.
We provide a set of common sample types such as for image captioning or visual question answering, they are listed below.
This will be sufficient in a simple scenario and if none of these fits, you may even create your own new sample type.
Here are your options:
* You have a new type sample which is rather common but not in our list below
* Please add your type to energon and create a pull request so we can add it
* Your sample type is experimental very special or used temporarily only
* You can add the sample type class in your code repository and create the `dataset.yaml` manually, referring to your class with `__class__`
* You can add the sample type class in your code repository, use a crude dataset and cookers (no need to put the sample type in `dataset.yaml`)
(sect-sample-types)=
#### Available Sample Types
These are the possible integrated types you can currently choose from:
* {py:class}`Sample <megatron.energon.Sample>`: Base dataclass for samples from source webdatasets.
* Attributes:
* {py:attr}`__key__: str <megatron.energon.Sample.__key__>`: Unique identifier of the sample within the dataset. Useful for backtracking the source of a single sample.
* {py:attr}`__key__: str <megatron.energon.Sample.__restore_key__>`: Structured key of the sample, which can be used to regenerate the sample without storing the whole sample.
* {py:attr}`__subflavors__: dict[str, Any] | None <megatron.energon.Sample.__subflavors__>`: Represents the subflavors (i.e. custom dict data) set for the source dataset (typically in the metadataset).
* {py:class}`CaptioningSample <megatron.energon.CaptioningSample>`: Represents a sample for captioning
* Attributes:
* {py:attr}`image: torch.Tensor <megatron.energon.CaptioningSample.image>`: The input image tensor
* {py:attr}`caption: str <megatron.energon.CaptioningSample.caption>`: The target caption string
* {py:class}`ImageSample <megatron.energon.ImageSample>`: Represents a sample which only contains an image (e.g. for reconstruction)
* Attributes:
* {py:attr}`image: torch.Tensor <megatron.energon.ImageSample.image>`: The image tensor
* {py:class}`ImageClassificationSample <megatron.energon.ImageClassificationSample>`: Represents a sample which contains an image with a caption
* Attributes:
* {py:attr}`image: torch.Tensor <megatron.energon.ImageClassificationSample.image>`: The image tensor
* {py:attr}`label: int | None <megatron.energon.ImageClassificationSample.label>`: The label of the sample, as integral representation
* {py:attr}`label_name: str | None <megatron.energon.ImageClassificationSample.label_name>`: The label of the sample
* {py:class}`InterleavedSample <megatron.energon.InterleavedSample>`: Represents a sample which contains interleaved media, such as image and text.
* Attributes:
* {py:attr}`sequence: list[torch.Tensor | str] <megatron.energon.InterleavedSample.sequence>`: The interleaved media (either a torch.Tensor or string for text)
* {py:class}`MultiChoiceVQASample <megatron.energon.MultiChoiceVQASample>`: Represents a sample for visual question answering, with a choice of answers and one correct answer.
* Attributes:
* {py:attr}`image: torch.Tensor <megatron.energon.MultiChoiceVQASample.image>`: The input image tensor
* {py:attr}`context: str <megatron.energon.MultiChoiceVQASample.context>`: The context/question for the image
* {py:attr}`choices: List[str] | None <megatron.energon.MultiChoiceVQASample.choices>`: The candidate answers
* {py:attr}`correct_choice_idx: int | None <megatron.energon.MultiChoiceVQASample.correct_choice_idx>`: The index of the correct answer
* {py:class}`OCRSample <megatron.energon.OCRSample>`: Sample type for optical character recognition.
* Attributes:
* {py:attr}`image: str <megatron.energon.OCRSample.image>`: The input image
* {py:attr}`text: str <megatron.energon.OCRSample.text>`: The text string for the whole image
* {py:attr}`block_boxes: torch.Tensor | None <megatron.energon.OCRSample.block_boxes>`: The bounding boxes of the block in the image float(N, 4|5<x,y,w,h,confidence>)
* {py:attr}`block_classes: torch.Tensor | list[str] | None <megatron.energon.OCRSample.block_classes>`: The classes of th blocks
* {py:attr}`block_text: torch.Tensor | None <megatron.energon.OCRSample.block_text>`: The text content of the blocks
* {py:attr}`lines_boxes: torch.Tensor | None <megatron.energon.OCRSample.lines_boxes>`: The bounding boxes of the text lines
* {py:attr}`lines_text: list[str] | None <megatron.energon.OCRSample.lines_text>`: The text content of the text lines
* {py:attr}`words_boxes: torch.Tensor | None <megatron.energon.OCRSample.words_boxes>`: The bounding boxes of the text words
* {py:attr}`words_text: list[str] | None <megatron.energon.OCRSample.words_text>`: The text content of the text words
* {py:attr}`chars_boxes: torch.Tensor | None <megatron.energon.OCRSample.chars_boxes>`: The bounding boxes of the text characters
* {py:attr}`chars_text: list[str] | None <megatron.energon.OCRSample.chars_text>`: The text content of the text characters
* {py:class}`TextSample <megatron.energon.TextSample>`: Represents a sample which only contains a text string (e.g. for text generation)
* Attributes:
* {py:attr}`text: str <megatron.energon.TextSample.text>`: The text string
* {py:class}`VidQASample <megatron.energon.VidQASample>`: Represents a sample which contains a video and a question with answer.
* Attributes:
* {py:attr}`video: VideoData <megatron.energon.VidQASample.image>`: The input image tensor
* {py:attr}`context: str <megatron.energon.VQASample.context>`: The context/question
* {py:attr}`answers: list[str] | None <megatron.energon.VQASample.answer>`: The answer string
* {py:attr}`answer_weights: torch.Tensor | None <megatron.energon.VQASample.answer_weights>`: Weights for possibly multiple answers
* {py:class}`VQASample <megatron.energon.VQASample>`: Represents a sample which contains an image, a question/context and an answer
* Attributes:
* {py:attr}`image: torch.Tensor <megatron.energon.VQASample.image>`: The input image tensor
* {py:attr}`context: str <megatron.energon.VQASample.context>`: The context/question
* {py:attr}`answers: list[str] | None <megatron.energon.VQASample.answer>`: The answer string
* {py:attr}`answer_weights: torch.Tensor | None <megatron.energon.VQASample.answer_weights>`: Weights for possibly multiple answers
* {py:class}`VQAOCRSample <megatron.energon.VQAOCRSample>`: Sample type for question answering related to optical character recognition.
* Attributes:
* {py:attr}`image: str <megatron.energon.VQAOCRSample.image>`: The input image
* {py:attr}`context: str <megatron.energon.VQAOCRSample.text>`: The context/question
* {py:attr}`text: str <megatron.energon.VQAOCRSample.text>`: The text contained in the image
* {py:attr}`answers: list[str] | None <megatron.energon.VQAOCRSample.answer>`: The answer string
* {py:attr}`answer_weights: torch.Tensor | None <megatron.energon.VQAOCRSample.answer_weights>`: Weights for possibly multiple answers
* {py:attr}`words_boxes: torch.Tensor | None <megatron.energon.VQAOCRSample.words_boxes>`: The bounding boxes of the text words
* {py:attr}`words_text: list[str] | None <megatron.energon.VQAOCRSample.words_text>`: The text content of the text words
(sample-loading)=
### Sample Loading
When you actually use and load your dataset, the data stored in the tar files needs to be converted to an instance of your chosen sample type.
There are three options:
1. The conversion is a simple 1:1 mapping of files to fields of the sample type class
* You can use a simple field map
2. Otherwise the now preferred way is to use a CrudeWebdataset and do the conversion inside a [cooker](crude-data).
3. There is another (now legacy) way, i.e. to create a custom `sample_loader.py` file next to your dataset.
* This option will continue to work, but we encourage to move to crude datasets in the future.
When running `energon prepare`, you can choose "Crude sample" as the sample type and the assistant will end.
If you picked another sample type, the assistant will ask if you want to use a "simple field map" or a "sample loader".
#### Simple Field Map
If your data consists of simple text, json and images that can be decoded by the standard [webdataset auto decoder](https://rom1504.github.io/webdataset/api/webdataset/autodecode.html),
and they map directly to the attributes of your chosen sample type from the list above, use a "field map".
The field map stores which file extension in the webdataset shall be mapped to which attribute of the sample class.
#### Sample Loader (Deprecated)
If your data needs some custom decoding code to compute the sample attributes from the data in the tar, you can use a custom sample loader.
However, starting from Energon 7, we recommend to use crude datasets and a [cooker](crude-data) instead.
If you use a `sample_loader.py`, its code shall only contain the dataset-specific decoding, no project-specific decoding.
Example for a special format (e.g. ocr dataset) for which we will use a custom `sample_loader.py`:
```text
parts
├── segs-000000.tar
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).jp2
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).lines.png
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).mp
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0025).words.png
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).jp2
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).lines.png
│ ├── 636f6d706f6e656e747362656e6566693030616e6472(0075).mp
│ └── ...
└── ...
```
`.mp` (`msgpack` content) files are automatically decoded, containing:
```json
{
"identifier": "componentsbenefi00andr",
"pageno": 25,
"size": {"w": 2286, "h": 3179},
"lines": [
{"l": 341, "t": 569, "b": 609, "r": 1974, "text": "CHAPTER 4 ADVANCED TRAFFIC CONTROL SYSTEMS IN INDIANA"},
{"l": 401, "t": 770, "b": 815, "r": 2065, "text": "A variety of traffic control systems currently exist"},
//...
],
"words": [
{"l": 341, "t": 577, "b": 609, "r": 544, "text": "CHAPTER"},
{"l": 583, "t": 578, "b": 607, "r": 604, "text": "4"},
//...
],
"chars": [
{"t": 579, "b": 609, "l": 341, "r": 363, "text": "C"},
{"t": 579, "b": 609, "l": 370, "r": 395, "text": "H"},
//...
],
}
```
`sample_loader.py`:
```python
import torch
def sample_loader(raw: dict) -> dict:
return dict(
__key__=raw["__key__"],
image=raw["jp2"],
text="\n".join(line["text"] for line in raw["mp"]["lines"]),
lines_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["lines"]
],
dtype=torch.int64,
),
lines_text=[line["text"] for line in raw["mp"]["lines"]],
words_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["words"]
],
dtype=torch.int64,
),
words_text=[line["text"] for line in raw["mp"]["words"]],
chars_boxes=torch.tensor(
[
(line["l"], line["t"], line["r"] - line["l"], line["b"] - line["t"])
for line in raw["mp"]["chars"]
],
dtype=torch.int64,
),
chars_text=[line["text"] for line in raw["mp"]["chars"]],
)
def part_filter(part: str) -> bool:
return part in ("jp2", "mp")
```
For more information please also read [](custom-sample-loader).
(data-on-disk)=
## Dataset Format on Disk (WebDataset)
The energon library supports loading large multi-modal datasets from disk.
To load the dataset, it must comply with the format described in this section unless it's a JSONL dataset.
A valid energon dataset must contain an `.nv-meta` folder with certain files as shown below.
```
my_dataset
├── .nv-meta
│ ├── dataset.yaml
│ ├── split.yaml
│ ├── .info.json
│ ├── index.sqlite
│ └── index.uuid
├── shards
│ ├── shard_000.tar
│ ├── shard_001.tar
│ ├── ...
```
Note that the `shards` folder is just an example. The shards and their folder can be named differently, but the `.nv-meta` structure is always the same.
### Files in `.nv-meta`
#### dataset.yaml (user editable)
The `dataset.yaml` contains the dataset definition, i.e. the dataset class to use as loader, optional decoders.
If you want to create such a file, you should consider using the [CLI preparation tool](energon-prepare).
Here's an example:
```yaml
sample_type:
__module__: megatron.energon
__class__: CaptioningSample
field_map:
image: jpg
caption: txt
```
For a crude dataset the `dataset.yaml` will simply be
```yaml
__module__: megatron.energon
__class__: CrudeWebdataset
```
The `__class__` and `__module__` values help the library construct the correct object.
The `field_map` specifies how the fields from each webdataset sample are mapped to the members of the sample dataclass.
In this example, the dataclass is
```python
@dataclass
class CaptioningSample(Sample):
image: torch.Tensor
caption: str
```
#### split.yaml (user editable)
This file contains the splits (i.e. train, val, test), each a list of the shards for each split.
It can also contain an "exclude list" to exclude certain samples or shards from training.
Example:
```yaml
exclude: []
split_parts:
train:
- shards/shard_000.tar
- shards/shard_001.tar
val:
- shards/shard_002.tar
test:
- shards/shard_003.tar
```
To exclude certain shards or samples, you need to add those to the `exclude` list as follows:
```yaml
exclude:
- shards/shard_004.tar
- shards/shard_001.tar/000032
- shards/shard_001.tar/000032
split_parts:
...
```
The above code excludes the entire shard `004` and two samples from the shard `001`.
#### .info.json (read-only)
The hidden info file is auto-generated and contains a list of all shards and the number of samples in each.
Example:
```json
{
"energon_version": "7.1.0",
"shard_counts": {
"shards/000.tar": 1223,
"shards/001.tar": 1420,
"shards/002.tar": 1418,
"shards/003.tar": 1358
}
}
```
The order of tar files is important, as it's used by the sqlite database below.
#### index.sqlite and index.uuid (read-only)
The sqlite database was introduced in Energon 7 and allows for fully random access of samples and files by their names.
This is a precondition for polylithic datasets and for the [`energon mount`](energon-mount) command.
Below there is some detailed information for the interested reader. Note that the internal table structure can
change in any release without notice.
The database contains an entry for each sample and sample part including their byte offsets and sizes in the tar files.
Example `samples` table:
| tar_file_id | sample_key | sample_index | byte_offset | byte_size |
| --- | --- | --- | --- | --- |
| 0 | 00000 | 0 | 0 | 35840 |
| 0 | 00001 | 1 | 35840 | 35840 |
| 0 | 00002 | 2 | 71680 | 35840 |
| 0 | ... | | | |
The byte offsets describe the range around all the tar entries that are part of that sample including the tar headers.
Corresponding example `sample_parts` table:
| tar_file_id | sample_index | part_name | content_byte_offset | content_byte_size |
| --- | --- | --- | --- | --- |
| 0 | 0 | json | 1536 | 31 |
| 0 | 0 | png | 3584 | 30168 |
| 0 | 0 | txt | 35328 | 16 |
| 0 | 1 | json | 37376 | 31 |
| 0 | 1 | png | 39424 | 30168 |
| 0 | 1 | txt | 71168 | 16 |
| 0 | ... | | | |
The byte offsets in the `sample_parts` table refer to the byte ranges of the actual file content and can be used to
directly access the content without parsing the tar header.
Both tables can be joined over the `tar_file_id` and the `sample_index`. Note that the `tar_file_id` refers to the list
of tar files in the `.info.json` file.
(data-on-disk-jsonl)=
## Dataset Format on Disk for JSONL Datasets
For the simpler JSONL option, you will still need to run `energon prepare`, but this will not create a full `.nv-meta` folder.
Instead, only an index file with the same base filename will be created.
So if your dataset is named `my_dataset.jsonl`, a new file `my_dataset.jsonl.idx` will appear next to it when preparing it.
That's all. The dataset type will always be `CrudeWebdataset` and the split part is `train` by default. However, when loading the dataset
you can change the split type to `val` or `test`.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Glossary
* **Batch Grouping**
* Allows you to programmatically decide which samples (out of a buffer) will be put into one batch. See [](../advanced/grouping.md).
* **Cooking**
* Used to transform crude (raw) samples into a populated instance of a sample data class.
* **Crude Dataset**
* An energon dataset, that does not yield a readily-populated sample (instance of dataclass), but a raw dict.
* A cooker is used to handle this transformation in the user's custom task encoder. See [](crude-data).
* **Grouping**
* See "Batch Grouping"
* **Monolithic Dataset**
* The simple form of putting all your text and media data into the same WebDataset (see [](monolithic-dataset)).
* The other option is to use a "Polylithic Dataset"
* **Packing**
* For Energon, with "packing" we mean "sequence packing". See "Sequence Packing" below.
* **Polylithic Dataset**
* Used to split the text-based data from the (usually larger) media data.
* Each modality will be put in its own dataset and one dataset can refer to the other by file names.
* For more information see [](polylithic-dataset)
* **Sample**
* In Energon, by sample we typically mean an instance of {py:class}`Sample <megatron.energon.Sample>` (e.g. one of its subclasses)
* Sometimes we also call the source files that are inside the WebDataset and are used to create that dataclass instance a "sample"
* For example inside one tar file there may be `004.jpg` and `004.txt` (image and label) together forming a captioning sample
* The {py:class}`Sample <megatron.energon.Sample>` dataclass has several mandatory and optional fields that describe one piece of training data for your ML workload. Typically it contains the input data to the model and the label data.
* **Sample Part**
* A "sample part" is one of the components of a sample inside the WebDataset tar file. A captioning sample may be created from `004.jpg` and `004.txt` and each of those files is a sample part. This sample with the *key* `004` has two *parts* `txt` and `jpg`.
* **Sequence Packing**
* A method to better utilize the available context length / sequence length of a model and reduce padding.
* Explained in [](../advanced/packing.md)
* **Task Encoder**
* An Energon-specific concept: The TaskEncoder is a user-defined class to customize the steps of the data flow pipeline.
* See [](../basic/basics_flow.md) and [](../basic/task_encoder.md)
* **WebDataset**
* A file-format to store your dataset on disk, based on TAR files. See [https://github.com/webdataset/webdataset](https://github.com/webdataset/webdataset).
* Energon's dataset format builds on WebDataset and extends it with additional files, see [](data-on-disk).
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Metadataset
Metadatasets allow combining datasets together in a variety of ways by using a yaml config file.
This is useful for example if you want to mix multiple datasets together, and especially if you want to reuse that combination.
To create a metadataset, you simply create a yaml file of the following format.
Example `demo-metadataset.yaml` file:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
# Split defines the train/val/test subsets composition
train:
# Mix the following datasets
blend:
- weight: 5
path: ./coco
- weight: 2
path: ./coyo
- weight: 1
path: ./other
val:
# Only use coco-val for val
path: ./coco
test:
# Only use coyo-test for test
path: ./coyo
```
In the above example, we create a blend of three datasets. Out of the yielded training samples, 62.5% ({math}`=\frac{5}{8}`) will come from `./coco`, 25% from `./coyo` and 12.5% from `./other`.
Note that the relative paths in the metadataset are relative to the location of the metadataset file. Absolute paths are allowed but won't work for object storage.
To use the metadataset in your loader, simply load it with {py:func}`get_train_dataset <megatron.energon.get_train_dataset>` instead of a normal energon dataset:
```python
from megatron.energon import get_train_dataset
ds = get_train_dataset(
'demo-metadataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
)
for batch in ds:
print(batch)
break
```
Here is another example that takes both the training and the validation set of coyo into the blended training data (with different weights though):
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
# Split defines the train/val/test subsets composition
train:
# Mix the following datasets
blend:
- weight: 5
path: ./coco
- weight: 2
path: ./coyo
split_part: train
- weight: 1
path: ./coyo
split_part: val # <-- Takes the val set of coyo into the train split
val:
# Only use coco-val for val
path: ./coco
test:
# Only use coyo-test for test
path: ./coyo
```
Actually `split_part: train` is the default, so there's no need to explicitely specify that.
When referring to datasets under `val:` obviously `split_part: val` is the default.
Energon also supports blending by specifying the number of repetitions for each dataset using [Epochized Blending](../advanced/epochized_blending).
(sect-subflavors)=
## Subflavors
Subflavors are a way to *tag* samples that come from different origins so that they can still be differentiated after blending.
Even when blending many datasets together, you might want to handle some of them differently in your [Task Encoder](task_encoder).
For example when doing OCR, you might have one dataset with full pages of text and one with only paragraphs. In your task encoder you could decide to augment the images differently.
Here is a modified example of the above `metadataset.yaml` config file that adds some subflavors:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
# Split defines the train/val/test subsets composition
train:
# Blend the following datasets
blend:
- weight: 5
path: ./coco
# Set the __subflavors__ property of the samples
subflavors:
augmentation_type: small_images
text_length: short
# Combine coyo-train and coyo-val
- weight: 2
path: ./coyo
split_part: train
# Set the __subflavors__ property of the samples
subflavors:
augmentation_type: large_images
text_length: short
- weight: 1
path: ./coyo
split_part: val
# Set the __subflavors__ property of the samples
subflavors:
augmentation_type: large_images
text_length: short
# For val and test, blending will actually concatenate the datasets
val:
# Only use coco val for val
path: ./coco
subflavors:
augmentation_type: small_images
text_length: short
test:
path: ./coyo
```
In the above example, the coco training samples will now have the subflavor `augmentation_type` set to `small_images` while the samples from coyo, will have that property set to `large_images`.
Note that subflavors are entirely custom and you can use any name and any value for them, for example `foo: bar`
In the code they will be passed around as a dictionary.
## Auxiliary Data
In the metadataset, you can also specify auxiliary data for each blended dataset.
Please check out [Auxiliary Data](aux-data), for more information.
## Classes
* {py:class}`DatasetLoaderInterface <megatron.energon.DatasetLoaderInterface>`: Common interface for dataset loaders. Provides methods for constructing/loading the actual train- or val-mode dataset.
* {py:class}`MetadatasetV2 <megatron.energon.MetadatasetV2>`: The metadataset loader using the yaml example above. Blends datasets for train-mode, and concatenates for val-mode.
* {py:class}`DatasetLoader <megatron.energon.DatasetLoader>`: The dataset loader using a dataprepped folder (containing `.nv-meta` folder).
## Functions
* {py:func}`get_train_dataset <megatron.energon.get_train_dataset>`: Returns the train-mode (meta)dataset.
* {py:func}`get_val_dataset <megatron.energon.get_val_dataset>`: Returns the val-mode (meta)dataset.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Quickstart
You can use Megatron Energon to load datasets in the energon dataset format. This format is essentially [WebDataset](https://github.com/webdataset/webdataset) with some meta data added.
Since version 7.2.0, we also support [JSONL](create-jsonl-dataset) as a simpler format with fewer features.
For a moment let's assume you already have prepared a dataset in the needed format, and it's stored on
disk at `/my/dataset/path`. If you want to create a dataset now, check out [](data_prep).
If you simply want to use some dummy data for trying this out, checkout the unit test method `create_captioning_test_dataset` inside `tests/test_dataset.py`.
```{admonition} Good to know
:class: tip
You can also store your dataset inside an S3-compatible object store and load it from there! See [](../advanced/remote_dataset)
```
You can then load the dataset like this:
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
ds = get_train_dataset(
'/my/dataset/path',
batch_size=1,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
)
loader = get_loader(ds)
for batch in loader:
# Do something with batch
# Infer, gradient step, ...
pass
```
At first, we call {py:meth}`get_train_dataset <megatron.energon.get_train_dataset>` (click to see signature).
The method will check what kind of dataset is on disk and instantiate the correct class for it.
A worker configuration is always needed to specify how the work is distributed across multiple ranks and workers.
In this simple example, we use a helper method {py:meth}`default_worker_config <megatron.energon.WorkerConfig.default_worker_config>` to get reasonable default values.
The dataset should not be iterated directly, but used with a loader which handles the worker processes.
The batches will contain samples of the sample type specified in the [task encoder](task_encoder).
```{admonition} Good to know
:class: tip
Since we did not specify a task encoder above, the {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>` will be used.
It will not transform the data. For batching it will use common sense magic to pad and stack tensors or build lists if the type is unknown.
```
_Wait. Why does the dataset create batches? Shouldn't the dataloader do that?_
Energon will create batches at dataset level.
Internally, most of the cool things that energon can do (such as blending datasets together, [sequence packing](../advanced/packing), etc.)
are dataset wrappers. Even the process of batching is such a wrapper and the default {py:meth}`get_train_dataset <megatron.energon.get_train_dataset>`
function will construct a suitable combination of all these based on the arguments you pass to that function.
Check out the [](basics_flow) section to see the steps in which the data is processed.
_Why must `shuffle_buffer_size` and `max_samples_per_sequence` be set explicitly?_
As the library is designed to work on (sequential) webdatasets but still wants to provide proper shuffling, these parameters are required. To make sure, the user does not forget to set these, we enforce them to be set explicitly.
A value of 100 for both settings for image datasets seems to work well (i.e. balanced shuffling randomness vs seeking performance impact), but datasets where the samples are lots larger or smaller might require different settings.
Setting the sequence length to a very small size compared to the number of samples in the dataset will result in more random access, thus slowing down dataloading, so the recommendation is to set it to a high enough value.
At the same time, a high value reduces the shuffling randomness, which requires a larger shuffle buffer size to compensate for that (i.e. higher memory footprint and longer state restore times).
## Tutorial 1: Preparing the Dataset
For running any tutorials, you'll need your data structured as webdataset. For more details and more options check out [](data_prep).
For example, in your folder containing the tar files, run
```sh
$ energon prepare ./
# Example answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
image: jpg
caption: txt # if txt contains the caption
# or
caption: json[caption] # if .json contains {"caption": "My nice image"}
```
## Tutorial 2: Loading a Dataset
Let's be a bit more concrete and try out the above data loading code with a real dataset.
We are going to print the first batch and stop.
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
ds = get_train_dataset(
'/path/to/your/dataset',
batch_size=None,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
)
loader = get_loader(ds)
for batch in loader:
print(batch)
break
```
This prints
```text
CaptioningSample(__key__='part_00123/00403.tar/004030195', image=tensor([[[0.4549, 0.4784, 0.5059, ..., 0.8392, 0.8275, 0.7961],
[0.4549, 0.4784, 0.5020, ..., 0.6431, 0.6275, 0.5882],
[0.4510, 0.4706, 0.4941, ..., 0.6235, 0.6353, 0.6078],
...,
[0.4471, 0.4196, 0.4510, ..., 0.8471, 0.8039, 0.8275],
[0.4667, 0.4353, 0.4667, ..., 0.8196, 0.7804, 0.8078],
[0.4824, 0.4549, 0.4824, ..., 0.8196, 0.7843, 0.8118]],
[[0.3608, 0.3843, 0.4118, ..., 0.7373, 0.7255, 0.6941],
[0.3608, 0.3843, 0.4078, ..., 0.5412, 0.5255, 0.4863],
[0.3569, 0.3765, 0.4000, ..., 0.5098, 0.5216, 0.4941],
...,
[0.3608, 0.3333, 0.3647, ..., 0.7529, 0.7098, 0.7333],
[0.3804, 0.3490, 0.3804, ..., 0.7255, 0.6863, 0.7137],
[0.3961, 0.3686, 0.3961, ..., 0.7255, 0.6902, 0.7176]],
[[0.2510, 0.2745, 0.3020, ..., 0.6000, 0.5882, 0.5569],
[0.2510, 0.2745, 0.2980, ..., 0.4039, 0.3882, 0.3490],
[0.2471, 0.2667, 0.2902, ..., 0.3765, 0.3882, 0.3608],
...,
[0.2667, 0.2392, 0.2706, ..., 0.6510, 0.6000, 0.6235],
[0.2863, 0.2549, 0.2863, ..., 0.6235, 0.5765, 0.6039],
[0.3020, 0.2745, 0.3020, ..., 0.6235, 0.5882, 0.6157]]]), caption='Cello Renting vs. Buying: Which is Right for You?')
```
Awesome, it returns a {py:class}`CaptioningSample <megatron.energon.CaptioningSample>` with the attributes
- `__key__`: `part_00123/00403.tar/004030195`, the identifier of the sample like `TAR_FILE/INDEX`
- All sample types will have a key. It's in the base class {py:class}`Sample <megatron.energon.Sample>`
- `image`: The image as a tensor of shape `(1, 3, 267, 400)` (RGB image in a batch of size 1)
- `caption`: A list of strings (here just one since batch size is one)
Let's also talk about the {py:class}`WorkerConfig <megatron.energon.WorkerConfig>`. As energon is made for distributed training,
you always need to provide a worker config to the dataset so specify how many ranks and workers there are and which rank you're currently on.
For this simple tutorial, we don't really distribute the work, so we use only a single rank with 4 workers. Check out the helper method {py:meth}`default_worker_config <megatron.energon.WorkerConfig.default_worker_config>` to see how the worker config is constructed. Also don't be afraid to click the *`[source]`* link and look at the very short source code of it.
## Tutorial 3: Batch Size
Actually, we would like to use a `batch_size` of more than one, let's go with 2 for now.
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
loader = get_loader(get_train_dataset(
'/path/to/your/dataset',
batch_size=2,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
))
for batch in loader:
print(batch)
break
```
The output will be similar to above but with different shapes and lengths:
- `batch.__key__`: A list of two keys
- `batch.image`: Tensor of shape `(2, 3, 267, 400)`
- `batch.caption`: A list of two caption strings
The default [task encoder](task_encoder) automagically padded and stacked the items to a batch.
This may be ok for some cases, but usually you will want to process and batch your data differently.
Hence, we can
- either use an existing task encoder
- or define a custom one (see [](task_encoder))
## Tutorial 3: Blending using Metadataset
A typical use case is to blend multiple datasets of the same (or similar type) together.
For example, you may want to blend the COCO dataset with the COYO dataset.
The easiest way to do this, is to use the metadataset pattern.
For this you need to create a new `yaml` file that defines the meta dataset:
`coyo-coco-dataset.yaml`:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
# Train dataset, the datasets will be blended according to their weights
train:
blend:
- weight: 5
path: ./coco
- weight: 2
path: ./coyo
# For val and test, datasets will be concatenated
val:
path: ./coco
test:
path: ./coyo
```
This assumes, that the datasets `coyo` and `coco` exist in subfolders next to the `coyo-coco-dataset.yaml` file. You could also use absolute paths, but that will not work well when using object storage such as S3.
Usage in your loader, simply use {py:func}`get_train_dataset <megatron.energon.get_train_dataset>`:
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
loader = get_loader(get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
))
for batch in loader:
print(batch)
break
```
If you need to handle samples from different datasets differently in your pipeline, you will want to use `subflavors`.
For these and other details, check out the [](metadataset) section. Energon also supports blending by specifying the number of repetitions for each dataset using [](../advanced/epochized_blending).
## Tutorial 4: Distributed Loading
For multi-GPU support, you may need to adapt the worker config.
So far we have only used the default worker config, which you can get by calling {py:func}`WorkerConfig.default_worker_config() <megatron.energon.WorkerConfig.default_worker_config>`.
This default config tries to infer your multi-GPU setup by using `torch.distributed`, which is fine in most cases.
If you are not using any distributed setup, the default config will work, too. In that case, it assumes a single local rank.
However, if you have a more complex multi-node setup and other non-data-parallel stragies, you may need to set it up yourself.
The following example shows how it could be set.
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
import torch.distributed as dist
worker_config = WorkerConfig(
rank=SET_YOUR_GLOBAL_DATA_RANK_HERE,
world_size=SET_YOUR_GLOBAL_WORLD_SIZE_HERE,
num_workers=2,
)
loader = get_loader(get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=worker_config,
))
for batch in loader:
print(batch)
break
```
For a detailed discussion on parallelism, check out [](../advanced/parallelism).
## Tutorial 5: Saving and Restoring
For saving and restoring the state (e.g. when pausing and resuming training), the loader must be instantiated with the savable loader.
```python
from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig
ds = get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
)
# Must use the savable loader here. This provides methods to save
# and load the state of the data loader
loader = get_savable_loader(ds)
for i, batch in zip(range(10), loader):
print(batch)
break
# Save the state
state = loader.save_state_rank()
# Could save the state now using torch.save()
# ... when loading:
# Could load the state with torch.load()
# Restore the state for a new loader
ds = get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
)
loader = get_savable_loader(ds)
loader.restore_state_rank(state)
```
We provide code for different scenarios for saving and loading in distributed settings especially in the section [](save_restore).
## More Features
Check out the topics in Advanced Usage for details on specific features.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Save and Restore
For long-running training jobs, you will usually need to stop and resume the training including the data loader.
One of energon's unqiue features is the deterministic save and restore capability.
At any iteration, you'll be able to store the overall state of the data loader across all ranks and accurately resume it later on, to continue where it left off.
Below, we list a few different ways to achieve that.
## Scenarios
All these scenarios work in distributed and non-distributed settings.
For simplicity, we are assuming the same loader and data for all scenarios from the [](../basic/quickstart):
```python
from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig
worker_config = WorkerConfig.default_worker_config()
def get_my_loader():
return get_savable_loader(get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=worker_config,
))
```
### 1. Save/Restore the State per Rank Separately
In this scenario, each rank saves and restores its own state in an independent file.
This is our recommended way, since it avoids transferring the data across ranks.
```python
# Saving the state
loader = get_my_loader()
# Iterate for some steps
for i, batch in zip(range(10), loader):
print(batch)
break
# Save the state
state = loader.save_state_rank()
# Save the state on each rank
# In this example, save the state using `torch.save`, this can of course be custom
torch.save(dataloader_state, f'dataloader_state_rank{worker_config.rank}.pth')
```
```python
# Restoring the state
loader = get_my_loader()
# Now, when restoring the state:
state = torch.load(f'dataloader_state_rank{worker_config.rank}.pth')
# Restore the state for the loader on each rank separately
loader.restore_state_rank(state)
```
### 2. Save/Restore the State on the Primary Rank Only
In this scenario, the primary rank (usually rank 0) is responsible for saving the state.
All ranks' states are collected (gathered) by one rank and can be stored in one file.
When restoring, the state is scatterd from the primary rank to all other ranks.
This approach centralizes the state management, which can simplify the process and reduces the number of files stored.
```python
# Saving the state
loader = get_my_loader()
# Iterate for some steps
for i, batch in zip(range(10), loader):
print(batch)
break
# Save the state to primary rank 0
state = loader.save_state_global(dst_rank=0)
if worker_config.rank == 0:
# Only rank 0 has the state now, for the others, the state is None
# In this example, save the state using `torch.save`, this can of course be custom
torch.save(dataloader_state, 'dataloader_state.pth')
```
```python
# Restoring the state
loader = get_my_loader()
# Load the state only on the primary rank
if worker_config.rank == 0:
state = torch.load('dataloader_state.pth')
else:
state = None
# Restore the state for the loader, broadcasting from rank 0
loader.restore_state_global(state, src_rank=0)
```
```{admonition} Note
:class: important
Even though only one rank collects the states, all ranks need to execute the `loader.save_state_global()` and `loader.restore_state_global()` lines of code
```
### 3. Save the State on the Primary Rank, Restore on Ranks Separately
In this scenario, the primary rank saves the state, but each rank restores the state separately. Each rank loads all saved states and selects the correct one. This approach combines centralized saving with distributed restoring and is rather uncommon.
Depending on the framework used for training, that framework may already handle the scattering/gathering of the states. In that case, refer to the first scenario using `save_state_rank`/`restore_state_rank`.
```python
# Saving the state
loader = get_my_loader()
# Iterate for some steps
for i, batch in zip(range(10), loader):
print(batch)
break
# Save the state
state = loader.save_state_global(dst_rank=0)
if worker_config.rank == 0:
# In this example, save the state using `torch.save`, this can of course be custom
torch.save(dataloader_state, 'dataloader_state.pth')
```
```python
# Restoring the state
loader = get_my_loader()
# Load on all ranks
state = torch.load('dataloader_state.pth')
# Restore the state for the loader on current rank, using all ranks checkpoint
loader.restore_state_global(state, src_rank=None)
```
## Summary
In each of these scenarios, ensure that the logic for saving and restoring the state is appropriately synchronized across ranks to maintain consistency.
If you encounter torch distributed errors, likely torch distributed calls are out of sync, or not all ranks are called correctly. If unsure, debug using the first scenario, saving each rank separately.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Task Encoder
The "Task Encoder" is an Energon-specific concept. It is a class that describes how each sample is going to be processed at different stages of the pipeline.
Please also take a look at [](basics_flow) to see the pipeline.
If you don't specify any task encoder, a default version, the {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>` will be used.
For writing your own task encoder, create a class based on {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>`
and override one or more of the following methods. The data flow of {py:func}`get_train_dataset <megatron.energon.get_train_dataset>` or {py:func}`get_val_dataset <megatron.energon.get_val_dataset>` is as follows:
- {py:meth}`def cook_crude_sample(self, sample: Union[T_sample, CrudeSample]) -> T_sample <megatron.energon.TaskEncoder.cook_crude_sample>`
- Optional. Define when using [crude data](crude-data).
- {py:func}`def encode_sample(self, sample: T_sample) -> T_encoded_sample <megatron.energon.DefaultTaskEncoder.encode_sample>`
- Transform the raw data from the dataset (e.g. augment/transform images, tokenize a single sample).
- Alternatively, you can use {py:meth}`preencode_sample <megatron.energon.TaskEncoder.preencode_sample>` and {py:meth}`postencode_sample <megatron.energon.TaskEncoder.postencode_sample>`. This is especially useful when you use packing and you want to defer loading to when the packed samples are known.
- {py:meth}`def preencode_sample(self, sample: T_sample) -> T_sample <megatron.energon.TaskEncoder.preencode_sample>`
- Can be used as a replacement to {py:func}`encode_sample <megatron.energon.DefaultTaskEncoder.encode_sample>`
- Only used together with {py:meth}`postencode_sample <megatron.energon.TaskEncoder.postencode_sample>` (below). Use it if you need to instantiate lazy data before packing the selected samples. But it will also work when not using packing.
- {py:meth}`def select_samples_to_pack(self, samples: List[T_encoded_sample]) -> List[List[T_encoded_sample]] <megatron.energon.TaskEncoder.select_samples_to_pack>`
- Optional. Allows for efficient sample packing. See [](../advanced/packing).
- {py:meth}`def postencode_sample(self, sample: T_sample) -> T_encoded_sample <megatron.energon.TaskEncoder.postencode_sample>`
- Only used together with {py:meth}`preencode_sample <megatron.energon.TaskEncoder.preencode_sample>`. Use it if you need to instantiate lazy data before packing the selected samples. But it will also work when not using packing.
- {py:meth}`def pack_selected_samples(self, samples: List[T_encoded_sample]) -> T_batch_sample] <megatron.energon.TaskEncoder.pack_selected_samples>`
- Required if select_samples_to_pack is used. Compresses a group of samples to a single sample.
- (samples are collected for a batch)
- {py:meth}`def batch(self, batch: List[T_encoded_sample]) -> T_raw_batch <megatron.energon.DefaultTaskEncoder.batch>`
- Collate the batch to a single sample, defaults to padded batching for tensors, lists for everything else.
- {py:meth}`def encode_batch(self, batch_data: T_raw_batch) -> T_batch <megatron.energon.DefaultTaskEncoder.encode_batch>`
- Transform the batched data (e.g. tokenize the whole batch).
- (optionally limit the dataset size, based on the `limit` argument)
- (optionally epochize the dataset)
- (move data from the worker to the main process through the {py:class}`torch.data.DataLoader` via {py:func}`get_loader <megatron.energon.get_loader>`)
- For batches based on {py:class}`Batch <megatron.energon.Batch>`, call {py:meth}`def pin_memory(self, batch: T_batch) -> T_batch <megatron.energon.Batch.pin_memory>`, or if not a dataclass, use default torch pinning (this must happen in the main process, thus after data loading)
If a sample or batch is to be ignored, any of these methods may raise {py:class}`IgnoreSample <megatron.energon.IgnoreSample>` to skip the sample being processed.
The types `T_sample`, `T_encoded_sample`, `T_raw_batch` and `T_batch` are generics and depend on your task. You do not necessarily have to specify them, it's only used for proper typing in your IDE.
```python
from dataclasses import dataclass
from typing import Callable, List, Optional
import torch
from megatron.energon import Batch, CaptioningSample, DefaultTaskEncoder, batch_list, batch_stack
# Type for intermediate batch, after batching operation
@dataclass
class CaptioningRawBatch(Batch):
# (n, c, h, w)
image: torch.Tensor
# (n,)
caption: List[str]
# Typing for the resulting batch data
@dataclass
class CaptioningBatch(Batch):
# (n, c, h, w)
images: torch.Tensor
# (n, c)
text_tokens: torch.Tensor
# (n, c, c)
text_attn_mask: torch.Tensor
# All the typing is optional
class CaptioningTaskEncoder(
DefaultTaskEncoder[CaptioningSample, CaptioningSample, CaptioningRawBatch, CaptioningBatch]
):
"""A simple task encoder for captioning."""
decoder = SampleDecoder(image_decode="torchrgb")
def __init__(
self,
tokenizer: Tokenizer,
image_transform: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,
max_length: int = 128,
):
# Specify the batch_type for default batching (batching is performed here "manually" by overwriting the `batch`
# method)
super().__init__(batch_type=CaptioningRawBatch)
self.tokenizer = tokenizer
self.image_transform = image_transform
self.max_length = max_length
def encode_sample(self, sample: CaptioningSample) -> CaptioningSample:
sample.image = self.image_transform(sample.image)
return sample
def batch(self, samples: List[CaptioningSample]) -> CaptioningRawBatch:
# Batch the samples
# By default, `batch_pad_stack` is used for all tensor fields, and `batch_list` is used for all non-tensor
# fields. This example matches the default implementation (not overwriting the `batch` method).
return CaptioningRawBatch.from_samples(samples)
def encode_batch(self, batch_data: CaptioningRawBatch) -> CaptioningBatch:
# Run the encoder on the batch of captions.
tokenized = self.tokenizer(batch_data.caption)
# Return the final batch, going into the network
return CaptioningBatch.derive_from(
batch_data,
images=batch_data.image,
text_tokens=tokenized["input_ids"],
text_attn_mask=tokenized["attention_mask"],
)
```
If you're wondering about the `decoder` assignment, check out [](../basic/data_decoding).
Usage in your training script:
```python
from torchvision import transforms
from transformers import AutoTokenizer
from megatron.energon import get_loader, get_train_dataset
train_img_transform = transforms.Compose(
[
transforms.RandomResizedCrop((224, 224)),
transforms.RandomHorizontalFlip(),
]
)
train_loader = get_loader(get_train_dataset(
'/my/dataset/path',
batch_size=32,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
task_encoder=CaptioningTaskEncoder(
tokenizer=AutoTokenizer.from_pretrained('gpt2'),
image_transform=train_img_transform,
),
))
for data in train_loader:
# data is a CaptioningBatch
pass
```
# Copyright (c) 2025, NVIDIA CORPORATION.
# SPDX-License-Identifier: BSD-3-Clause
# -*- coding: utf-8 -*-
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
# src_folder = pathlib.Path(__file__).parents[2]
# print("Copying README.md to docs/source/development_setup.md")
# shutil.copyfile(
# str(src_folder / ".." / "README.md"), str(src_folder / "docs" / "source" / "development_setup.md")
# )
# Add path to energon module
sys.path.insert(0, os.path.abspath("../../src"))
# -- Project information -----------------------------------------------------
project = "megatron-energon"
copyright = "2025 NVIDIA Corporation"
author = "Lukas Voegtle, Philipp Fischer"
# The short X.Y version
version = ""
# The full version, including alpha/beta/rc tags
release = ""
# -- General configuration ---------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.viewcode",
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
"myst_parser", # markdown(*.md) parser
"sphinx_click",
]
# Autodoc
autodoc_mock_imports = [
"braceexpand",
"fsspec",
# "torch",
"webdataset",
"tqdm",
"numpy",
"PIL",
"s3fs",
]
autodoc_typehints = "description"
autodoc_typehints_format = "short"
# Mock everything here, so that not just autodoc, but also sphinx_click can make use of the mock imports
from sphinx.ext.autodoc.mock import MockFinder
sys.meta_path.insert(0, MockFinder(autodoc_mock_imports))
# Napoleon
napoleon_google_docstring = True
napoleon_numpy_docstring = True
napoleon_include_init_with_doc = True
napoleon_include_private_with_doc = False
napoleon_include_special_with_doc = False
napoleon_use_admonition_for_examples = False
napoleon_use_admonition_for_notes = False
napoleon_use_admonition_for_references = False
napoleon_use_ivar = False
napoleon_use_param = True
napoleon_use_rtype = True
napoleon_use_keyword = True
napoleon_custom_sections = None
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
# source_suffix = ".rst"
source_suffix = {
".rst": "restructuredtext",
".md": "markdown",
}
# The master toctree document.
master_doc = "index"
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = "en"
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
html_theme_options = {
"prev_next_buttons_location": "both",
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_css_files = ["css/custom.css"]
# Favicon configuration
html_favicon = "_static/favicon.ico"
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
#
# html_sidebars = {}
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = "energondoc"
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = []
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, "megatron-energon", "Megatron-Energon Documentation", [author], 1)]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = []
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ["search.html"]
# -- Extension configuration -------------------------------------------------
<mxfile host="Electron" agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/26.2.2 Chrome/134.0.6998.178 Electron/35.1.2 Safari/537.36" version="26.2.2">
<diagram id="07fea595-8f29-1299-0266-81d95cde20df" name="Page-1">
<mxGraphModel dx="2779" dy="5033" grid="1" gridSize="10" guides="0" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1000" pageHeight="4000" background="none" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
<mxCell id="EeR23cp9OgTQIamsDvA7-231" value="&lt;font color=&quot;#23497d&quot;&gt;TaskEncoder&lt;/font&gt;" style="swimlane;whiteSpace=wrap;fillColor=none;swimlaneFillColor=#BAC8D3;fontColor=#2F5B7C;fontFamily=Tahoma;html=1;strokeColor=none;opacity=50;align=right;startSize=15;verticalAlign=bottom;" parent="1" vertex="1">
<mxGeometry x="80" y="210" width="550" height="1420" as="geometry">
<mxRectangle x="24.5" y="88.5" width="80" height="23" as="alternateBounds" />
</mxGeometry>
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-262" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="EeR23cp9OgTQIamsDvA7-233" target="215" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="240" y="269.5" as="targetPoint" />
<Array as="points" />
</mxGeometry>
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-233" value="&lt;b&gt;encode_sample()&lt;br&gt;&lt;/b&gt;or&lt;br&gt;&lt;b&gt;preencode_sample()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="280" y="188.5" width="180" height="60" as="geometry" />
</mxCell>
<mxCell id="215" value="&lt;b style=&quot;border-color: var(--border-color);&quot;&gt;EncodedCaptioningSample&lt;/b&gt;&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;image: torch.Tensor&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;caption: str&lt;/div&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;...&lt;/div&gt;" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;fontFamily=Helvetica;fontSize=12;fontColor=#FFFFFF;align=left;fillColor=#12aab5;strokeColor=none;shadow=0;fontStyle=0;spacing=7;spacingBottom=0;verticalAlign=middle;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="270" y="278.5" width="200" height="81.5" as="geometry" />
</mxCell>
<mxCell id="216" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=none;endFill=0;startArrow=none;" parent="EeR23cp9OgTQIamsDvA7-231" source="215" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="459" y="824" as="sourcePoint" />
<mxPoint x="240" y="408.5" as="targetPoint" />
<Array as="points">
<mxPoint x="370" y="380.5" />
<mxPoint x="240" y="380.5" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="221" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=none;endFill=0;startArrow=none;" parent="EeR23cp9OgTQIamsDvA7-231" source="219" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="370.030303030303" y="148.5" as="sourcePoint" />
<mxPoint x="240" y="168.5" as="targetPoint" />
<Array as="points">
<mxPoint x="370" y="139" />
<mxPoint x="240" y="139" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="184" value="&lt;b&gt;pack_selected_samples()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="290" y="638.5" width="170" height="61.5" as="geometry" />
</mxCell>
<mxCell id="185" value="&lt;b&gt;batch_group_criterion()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="290" y="878.5" width="170" height="61.5" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-264" value="&lt;b style=&quot;border-color: var(--border-color);&quot;&gt;CaptioningBatch&lt;/b&gt;&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;image: torch.Tensor&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;caption: List[str]&lt;/div&gt;" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;fontFamily=Helvetica;fontSize=12;fontColor=#FFFFFF;align=left;fillColor=#12aab5;strokeColor=none;shadow=0;fontStyle=0;spacing=7;spacingBottom=0;verticalAlign=middle;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="160" y="1100" width="180" height="63" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-268" value="&lt;b&gt;encode_batch()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="290" y="1210" width="170" height="61.5" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-269" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="EeR23cp9OgTQIamsDvA7-264" target="EeR23cp9OgTQIamsDvA7-268" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="240" y="1180.5" />
<mxPoint x="375" y="1180.5" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="196" value="&lt;div style=&quot;&quot;&gt;&lt;b style=&quot;border-color: var(--border-color);&quot;&gt;EncodedCaptioningBatch&lt;/b&gt;&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;&lt;span style=&quot;&quot;&gt;image: torch.Tensor&lt;/span&gt;&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;caption: List[str]&lt;/div&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;...&lt;/div&gt;&lt;/div&gt;" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;fontFamily=Helvetica;fontSize=12;fontColor=#FFFFFF;align=left;fillColor=#12aab5;strokeColor=none;shadow=0;fontStyle=0;spacing=7;spacingBottom=0;verticalAlign=middle;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="280" y="1300" width="190" height="83" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-273" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="EeR23cp9OgTQIamsDvA7-268" target="196" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="240" y="1198" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-241" value="&lt;b&gt;batch()&lt;br&gt;&lt;/b&gt;&lt;i&gt;&lt;br&gt;(may be overriden, but uncommon)&lt;/i&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="160" y="990" width="160" height="81.5" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-265" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="EeR23cp9OgTQIamsDvA7-241" target="EeR23cp9OgTQIamsDvA7-264" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="240" y="1091.5" />
<mxPoint x="240" y="1091.5" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="201" value="&lt;b style=&quot;border-color: var(--border-color);&quot;&gt;PackedCaptioningSample&lt;/b&gt;&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;image: torch.Tensor&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;caption: str&lt;/div&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;...&lt;/div&gt;" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;fontFamily=Helvetica;fontSize=12;fontColor=#FFFFFF;align=left;fillColor=#12aab5;strokeColor=none;shadow=0;fontStyle=0;spacing=7;spacingBottom=0;verticalAlign=middle;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="280" y="728.5" width="190" height="82" as="geometry" />
</mxCell>
<mxCell id="202" value="" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="184" target="201" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="369" y="669.5" as="sourcePoint" />
<mxPoint x="240" y="922.5" as="targetPoint" />
<Array as="points" />
</mxGeometry>
</mxCell>
<mxCell id="190" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=none;endFill=0;startArrow=none;" parent="EeR23cp9OgTQIamsDvA7-231" source="201" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="330" y="580.75" as="sourcePoint" />
<mxPoint x="240" y="858.5" as="targetPoint" />
<Array as="points">
<mxPoint x="375" y="829" />
<mxPoint x="240" y="829" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="197" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=none;endFill=0;" parent="EeR23cp9OgTQIamsDvA7-231" source="196" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="369.5" y="1161" as="sourcePoint" />
<mxPoint x="240" y="1420" as="targetPoint" />
<Array as="points">
<mxPoint x="375" y="1400" />
<mxPoint x="240" y="1400" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="6ks5d3bdUb8dPeCm2RqI-218" value="&lt;b&gt;postencode_sample()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="260" y="538.5" width="200" height="61.5" as="geometry" />
</mxCell>
<mxCell id="6ks5d3bdUb8dPeCm2RqI-221" value="" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="6ks5d3bdUb8dPeCm2RqI-218" target="184" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="375" y="598.5" as="sourcePoint" />
<mxPoint x="375" y="638.5" as="targetPoint" />
<Array as="points">
<mxPoint x="380" y="629" />
<mxPoint x="380" y="629" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="224" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" target="185" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="240" y="828.5" as="sourcePoint" />
<mxPoint x="310" y="878.5" as="targetPoint" />
<Array as="points">
<mxPoint x="240" y="849" />
<mxPoint x="375" y="849" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="225" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=none;endFill=0;startArrow=none;" parent="EeR23cp9OgTQIamsDvA7-231" source="6ks5d3bdUb8dPeCm2RqI-218" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="370" y="380" as="sourcePoint" />
<mxPoint x="240" y="638.5" as="targetPoint" />
<Array as="points">
<mxPoint x="310" y="619" />
<mxPoint x="240" y="619" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-263" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;entryX=0.25;entryY=0;entryDx=0;entryDy=0;" parent="EeR23cp9OgTQIamsDvA7-231" target="6ks5d3bdUb8dPeCm2RqI-218" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="240" y="488.5" as="sourcePoint" />
<mxPoint x="310" y="328.5" as="targetPoint" />
<Array as="points">
<mxPoint x="240" y="508.5" />
<mxPoint x="310" y="508.5" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="218" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Grouped Batching&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;For example for OpenSORA-Style (group by image size and sequence length with different batch size of each bucket)&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="-90" y="860" width="240" height="100" as="geometry" />
</mxCell>
<mxCell id="2y5nxieE3EETo1UtqVLI-175" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Batching&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;Default batching stacks tensors or creates lists of strings&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="-90" y="970" width="240" height="201.5" as="geometry" />
</mxCell>
<mxCell id="2y5nxieE3EETo1UtqVLI-176" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Batch Encoding&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;Tokenization with padding to longest sequence, or image padding and stacking. Custom batching operations.&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="-90" y="1191.5" width="240" height="200" as="geometry" />
</mxCell>
<mxCell id="183" value="&lt;b&gt;select_samples_to_pack()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="280" y="430" width="180" height="61.5" as="geometry" />
</mxCell>
<mxCell id="245" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=none;endFill=0;startArrow=none;" parent="EeR23cp9OgTQIamsDvA7-231" source="239" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="370" y="-20" as="sourcePoint" />
<mxPoint x="240" y="28.5" as="targetPoint" />
<Array as="points">
<mxPoint x="370" y="-20" />
<mxPoint x="240" y="-20" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="2y5nxieE3EETo1UtqVLI-180" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Customizable TaskEncoder&lt;br&gt;&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;Methods can be overridden and are optional. Methods may change the outgoing sample type if marked as such.&lt;/p&gt;" style="text;html=1;strokeColor=none;fillColor=none;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="10" y="20" width="230" height="130" as="geometry" />
</mxCell>
<mxCell id="2y5nxieE3EETo1UtqVLI-174" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Sample Encoding&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;For example image augmentations and cropping, maybe unpadded tokenizations.&lt;br&gt;You can either use just `encode_sample()`, or pre- and postencode if you want to split it.&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="-90" y="170" width="240" height="200" as="geometry" />
</mxCell>
<mxCell id="219" value="Shuffle Buffer" style="shape=cylinder3;whiteSpace=wrap;html=1;boundedLbl=1;backgroundOutline=1;size=15;fillColor=#2f5b7c;strokeColor=#FFFFFF;fontColor=#FFFFFF;shadow=0;fontStyle=0;gradientColor=none;spacing=6;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="310" y="60" width="120" height="61.5" as="geometry" />
</mxCell>
<mxCell id="229" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="183" target="6ks5d3bdUb8dPeCm2RqI-218" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="390" y="689" as="targetPoint" />
<Array as="points">
<mxPoint x="380" y="500" />
<mxPoint x="380" y="500" />
</Array>
<mxPoint x="390" y="578.5" as="sourcePoint" />
</mxGeometry>
</mxCell>
<mxCell id="227" value="Packing Buffer" style="shape=cylinder3;whiteSpace=wrap;html=1;boundedLbl=1;backgroundOutline=1;size=15;fillColor=#2f5b7c;strokeColor=#FFFFFF;fontColor=#FFFFFF;shadow=0;fontStyle=0;gradientColor=none;spacing=6;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="420" y="470" width="110" height="61.5" as="geometry" />
</mxCell>
<mxCell id="194" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="EeR23cp9OgTQIamsDvA7-231" source="185" target="EeR23cp9OgTQIamsDvA7-241" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="369.5" y="1095.5" as="sourcePoint" />
<mxPoint x="250" y="1142.5" as="targetPoint" />
<Array as="points">
<mxPoint x="375" y="1030" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="231" value="Group Buffers" style="shape=cylinder3;whiteSpace=wrap;html=1;boundedLbl=1;backgroundOutline=1;size=15;fillColor=#2f5b7c;strokeColor=#FFFFFF;fontColor=#FFFFFF;shadow=0;fontStyle=0;gradientColor=none;spacing=6;" parent="EeR23cp9OgTQIamsDvA7-231" vertex="1">
<mxGeometry x="420" y="920" width="110" height="61.5" as="geometry" />
</mxCell>
<mxCell id="170" value="&lt;font color=&quot;#23497d&quot;&gt;Flavor (e.g. &quot;&lt;/font&gt;Captioning&quot;, &quot;Crude&quot;)" style="swimlane;whiteSpace=wrap;fillColor=none;swimlaneFillColor=#BAC8D3;fontColor=#2F5B7C;fontFamily=Tahoma;html=1;strokeColor=none;opacity=50;align=right;" parent="1" vertex="1">
<mxGeometry x="80" y="-560" width="380" height="630" as="geometry">
<mxRectangle x="24.5" y="88.5" width="80" height="23" as="alternateBounds" />
</mxGeometry>
</mxCell>
<mxCell id="2y5nxieE3EETo1UtqVLI-178" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Constructing the Dataset&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;The &lt;b&gt;get_dataset()&lt;/b&gt; method, automatically determines the right class to use from the &lt;b&gt;dataset.yaml&lt;/b&gt; file.&lt;/p&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;You can pass your &lt;b&gt;TaskEncoder&lt;/b&gt; to the method to use custom methods as shown below&lt;/p&gt;" style="text;html=1;strokeColor=none;fillColor=none;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="170" vertex="1">
<mxGeometry x="10" y="30" width="320" height="140" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-249" value="&lt;b style=&quot;border-color: var(--border-color);&quot;&gt;CaptioningSample&lt;/b&gt;&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;image: torch.Tensor&lt;br style=&quot;border-color: var(--border-color);&quot;&gt;&lt;div style=&quot;border-color: var(--border-color);&quot;&gt;caption: str&lt;/div&gt;" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;fontFamily=Helvetica;fontSize=12;fontColor=#FFFFFF;align=left;fillColor=#12aab5;strokeColor=none;shadow=0;fontStyle=0;spacing=7;spacingBottom=0;verticalAlign=middle;" parent="170" vertex="1">
<mxGeometry x="160" y="540" width="160" height="70" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-253" value="" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=#dce3e9;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;labelBorderColor=none;spacingLeft=0;spacingRight=-3;spacing=3;" parent="170" edge="1">
<mxGeometry x="-0.2" relative="1" as="geometry">
<mxPoint as="offset" />
<Array as="points">
<mxPoint x="240" y="480" />
<mxPoint x="240" y="480" />
</Array>
<mxPoint x="240" y="490" as="sourcePoint" />
<mxPoint x="240" y="540" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-251" value="&lt;b&gt;get_dataset()&lt;/b&gt;&lt;br&gt;constructs a&lt;br&gt;CaptioningWebdataset" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;" parent="170" vertex="1">
<mxGeometry x="160" y="330" width="160" height="59.25" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-218" value="Captioning dataset&lt;br&gt;as web dataset on disk / network with&lt;br&gt;dataset.yaml" style="shape=cylinder3;whiteSpace=wrap;html=1;boundedLbl=1;backgroundOutline=1;size=15;fillColor=#2f5b7c;strokeColor=#FFFFFF;fontColor=#FFFFFF;shadow=0;fontStyle=0;gradientColor=none;spacing=6;" parent="170" vertex="1">
<mxGeometry x="170" y="180" width="135" height="118.5" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-252" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="170" source="EeR23cp9OgTQIamsDvA7-218" target="EeR23cp9OgTQIamsDvA7-251" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="238" y="310" />
<mxPoint x="238" y="310" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="254" value="&lt;b&gt;my_cooker()&lt;/b&gt;&lt;br&gt;&lt;span style=&quot;color: rgb(255, 255, 255);&quot;&gt;or &lt;b&gt;field_map&lt;/b&gt;&lt;/span&gt;&lt;div&gt;&lt;span style=&quot;color: rgb(255, 255, 255);&quot;&gt;or &lt;b&gt;sample_loader.py&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="170" vertex="1">
<mxGeometry x="160" y="441" width="160" height="60" as="geometry" />
</mxCell>
<mxCell id="259" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Cooking&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;For crude datasets, use the cooker function matching the subflavors in the TaskEncoder. Field map and sample loader are automatically converted to the sample type.&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="170" vertex="1">
<mxGeometry x="-90" y="410" width="240" height="130" as="geometry" />
</mxCell>
<mxCell id="255" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Loading Dataset&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;Open the dataset using &lt;b&gt;get_dataset()&lt;/b&gt; and construct the corresponding Sample reader.&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="170" vertex="1">
<mxGeometry x="-90" y="170" width="240" height="230" as="geometry" />
</mxCell>
<mxCell id="8GhKufcrAOcTtvPUTNyW-276" value="&amp;nbsp;iterate&amp;nbsp;" style="shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;labelBackgroundColor=#dce3e9;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;labelBorderColor=none;spacingLeft=0;spacingRight=-3;spacing=3;exitX=0.5;exitY=0.979;exitDx=0;exitDy=0;exitPerimeter=0;" edge="1" parent="170" source="EeR23cp9OgTQIamsDvA7-251" target="254">
<mxGeometry x="-0.2859" relative="1" as="geometry">
<mxPoint as="offset" />
<mxPoint x="240" y="390" as="sourcePoint" />
<mxPoint x="240" y="430" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="2y5nxieE3EETo1UtqVLI-179" style="edgeStyle=orthogonalEdgeStyle;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;endArrow=block;strokeColor=#2F5B7C;strokeWidth=3;endFill=1;" parent="1" target="219" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="320" y="240" />
<mxPoint x="450" y="240" />
</Array>
<mxPoint x="320" y="220" as="sourcePoint" />
</mxGeometry>
</mxCell>
<mxCell id="214" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="1" target="183" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="320" y="570" as="sourcePoint" />
<mxPoint x="539" y="611.5" as="targetPoint" />
<Array as="points">
<mxPoint x="320" y="610" />
<mxPoint x="450" y="610" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="222" style="edgeStyle=orthogonalEdgeStyle;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;endArrow=block;strokeColor=#2F5B7C;strokeWidth=3;endFill=1;" parent="1" target="EeR23cp9OgTQIamsDvA7-233" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="320" y="370" />
<mxPoint x="450" y="370" />
</Array>
<mxPoint x="320" y="340" as="sourcePoint" />
<mxPoint x="460" y="170" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="226" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="1" source="EeR23cp9OgTQIamsDvA7-249" target="EeR23cp9OgTQIamsDvA7-241" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="320" y="730" as="sourcePoint" />
<mxPoint x="320" y="1080.5" as="targetPoint" />
<Array as="points" />
</mxGeometry>
</mxCell>
<mxCell id="239" value="blend datasets" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="1" vertex="1">
<mxGeometry x="360" y="110" width="180" height="60" as="geometry" />
</mxCell>
<mxCell id="241" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="1" target="239" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="460" y="479" as="targetPoint" />
<Array as="points">
<mxPoint x="450" y="80" />
</Array>
<mxPoint x="320" y="60" as="sourcePoint" />
</mxGeometry>
</mxCell>
<mxCell id="217" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Packing&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;For example sequence packing (compressing text samples into a single sample, reduing padding)&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="1" vertex="1">
<mxGeometry x="-10" y="620" width="240" height="410" as="geometry" />
</mxCell>
<mxCell id="EeR23cp9OgTQIamsDvA7-275" value="&lt;b&gt;model.forward()&lt;/b&gt;" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="1" vertex="1">
<mxGeometry x="241" y="1660" width="158" height="60" as="geometry" />
</mxCell>
<mxCell id="186" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="1" source="EeR23cp9OgTQIamsDvA7-264" target="EeR23cp9OgTQIamsDvA7-275" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="410" y="1145" as="sourcePoint" />
<mxPoint x="320" y="1650" as="targetPoint" />
<Array as="points">
<mxPoint x="320" y="1440" />
<mxPoint x="320" y="1440" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="260" value="&lt;h1 style=&quot;line-height: 110%;&quot;&gt;Blending&lt;/h1&gt;&lt;p style=&quot;line-height: 120%;&quot;&gt;Blend multiple datasets together as specified in the metadataset yaml file.&lt;/p&gt;" style="text;html=1;strokeColor=#6c8ebf;fillColor=#dae8fc;spacing=7;spacingTop=-20;whiteSpace=wrap;overflow=hidden;rounded=0;" parent="1" vertex="1">
<mxGeometry x="-10" y="90" width="240" height="90" as="geometry" />
</mxCell>
<mxCell id="261" value="Other Datasets" style="swimlane;whiteSpace=wrap;fillColor=none;swimlaneFillColor=#BAC8D3;fontColor=#2F5B7C;fontFamily=Tahoma;html=1;strokeColor=none;opacity=50;align=right;" parent="1" vertex="1">
<mxGeometry x="480" y="-560" width="200" height="630" as="geometry">
<mxRectangle x="24.5" y="88.5" width="80" height="23" as="alternateBounds" />
</mxGeometry>
</mxCell>
<mxCell id="263" value="" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;fontFamily=Helvetica;fontSize=12;fontColor=#FFFFFF;align=left;fillColor=#12aab5;strokeColor=none;shadow=0;fontStyle=0;spacing=7;spacingBottom=0;verticalAlign=middle;" parent="261" vertex="1">
<mxGeometry x="20" y="540" width="160" height="70" as="geometry" />
</mxCell>
<mxCell id="264" value="" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;labelBackgroundColor=#dce3e9;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;labelBorderColor=none;spacingLeft=0;spacingRight=-3;spacing=3;" parent="261" source="268" target="263" edge="1">
<mxGeometry x="-0.2" relative="1" as="geometry">
<mxPoint as="offset" />
<Array as="points">
<mxPoint x="100" y="440" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="265" value="" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;" parent="261" vertex="1">
<mxGeometry x="20" y="330" width="160" height="59.25" as="geometry" />
</mxCell>
<mxCell id="266" value="" style="shape=cylinder3;whiteSpace=wrap;html=1;boundedLbl=1;backgroundOutline=1;size=15;fillColor=#2f5b7c;strokeColor=#FFFFFF;fontColor=#FFFFFF;shadow=0;fontStyle=0;gradientColor=none;spacing=6;" parent="261" vertex="1">
<mxGeometry x="30" y="180" width="135" height="118.5" as="geometry" />
</mxCell>
<mxCell id="267" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="261" source="266" target="265" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points">
<mxPoint x="98" y="310" />
<mxPoint x="98" y="310" />
</Array>
</mxGeometry>
</mxCell>
<mxCell id="268" value="" style="whiteSpace=wrap;strokeColor=none;fillColor=#2f5b7c;shadow=0;fontColor=#FFFFFF;fontFamily=Helvetica;fontStyle=0;html=1;fontSize=12;spacing=6;verticalAlign=middle;dashed=1;dashPattern=12 12;" parent="261" vertex="1">
<mxGeometry x="20" y="430" width="160" height="60" as="geometry" />
</mxCell>
<mxCell id="269" style="edgeStyle=orthogonalEdgeStyle;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;endArrow=block;strokeColor=#2F5B7C;strokeWidth=3;endFill=1;" parent="261" source="265" target="268" edge="1">
<mxGeometry relative="1" as="geometry">
<Array as="points" />
<mxPoint x="100" y="350" as="sourcePoint" />
<mxPoint x="230" y="400" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="242" style="edgeStyle=orthogonalEdgeStyle;shape=connector;rounded=1;orthogonalLoop=1;jettySize=auto;html=1;labelBackgroundColor=default;strokeColor=#2F5B7C;strokeWidth=3;fontFamily=Helvetica;fontSize=11;fontColor=default;endArrow=block;endFill=1;" parent="1" source="263" target="239" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="330" y="120" as="targetPoint" />
<Array as="points">
<mxPoint x="580" y="80" />
<mxPoint x="450" y="80" />
</Array>
<mxPoint x="640" y="50" as="sourcePoint" />
</mxGeometry>
</mxCell>
<mxCell id="275" value="" style="group" parent="1" vertex="1" connectable="0">
<mxGeometry x="595" y="75" width="50" height="10" as="geometry" />
</mxCell>
<mxCell id="272" value="" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;fillColor=light-dark(#3f5a7a, #ededed);strokeColor=none;" parent="275" vertex="1">
<mxGeometry width="10" height="10" as="geometry" />
</mxCell>
<mxCell id="273" value="" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;fillColor=light-dark(#3f5a7a, #ededed);strokeColor=none;" parent="275" vertex="1">
<mxGeometry x="20" width="10" height="10" as="geometry" />
</mxCell>
<mxCell id="274" value="" style="ellipse;whiteSpace=wrap;html=1;aspect=fixed;fillColor=light-dark(#3f5a7a, #ededed);strokeColor=none;" parent="275" vertex="1">
<mxGeometry x="40" width="10" height="10" as="geometry" />
</mxCell>
</root>
</mxGraphModel>
</diagram>
</mxfile>
This source diff could not be displayed because it is too large. You can view the blob instead.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Megatron-Energon Documentation
This is the documentation of Megatron's multi-modal data loader "Energon".
We recommend getting started in the [Introduction](intro/introduction) section, which explains what Energon is and how to install it.
Once installed, check out the **Basic Usage** section starting with [Quickstart](basic/quickstart) for some basic examples and tutorials.
Some underlying concepts, will be explained in the rest of that section.
For specific use cases and advanced usage, please read **Advanced Usage**.
In the end you will also find some documentation on how to interface with energon programmatically and how to contribute to the code base.
```{toctree}
---
caption: Introduction
maxdepth: 2
---
intro/introduction
intro/installation
```
```{toctree}
---
caption: Basic Usage
maxdepth: 2
---
basic/quickstart
basic/data_prep
basic/data_decoding
basic/basics_flow
basic/task_encoder
basic/metadataset
basic/save_restore
basic/glossary
```
```{toctree}
---
caption: Advanced Usage
maxdepth: 2
---
advanced/remote_dataset
advanced/crude_datasets
advanced/custom_sample_loader
advanced/repro_scaling
advanced/packing
advanced/grouping
advanced/joining_datasets
advanced/subsets
advanced/epochized_blending
advanced/custom_blending
advanced/parallelism
```
```{toctree}
---
caption: API
maxdepth: 2
---
api/modules
api/cli
```
```{toctree}
---
caption: Internals
maxdepth: 2
---
internals/contrib_guidelines
internals/code_structure
```
# Indices and tables
- [](genindex)
- [](modindex)
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Code Structure
This section is meant to provide an introduction to Megatron Energon for developers who want to cotribute to energon itself.
For now, this is still a placeholder and we encourage you to get in touch with us for an introduction.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Contribution Guidelines
If you want to contribute to this repository please adhere to the following guidelines
- Always use [black](https://pypi.org/project/black/) and [isort](https://pycqa.github.io/isort/) to format your code before committing
- Check that all license headers are present using `python3 scripts/license_headers.py --fix .`
- Python `@dataclass` and `NamedTuple` are preferred over dictionaries, which don't allow for IDE
auto-completion and type checking
- User-exposed classes and methods should be documented in Google-style docstrings that are parsed by sphinx
and end up in this documentation
- Breaking changes should be marked in the message of pull requests:
- `CHECKPOINT BREAKING CHANGE`: When the save/restore structure changed incompatibly (check test `test_metadataset:TestDataset.test_save_restore_state_train`)
- `ITERATION ORDER BREAKING CHANGE`: When the order of iterating samples changed, i.e. experiments would not be exactly reproducible (check tests `test_dataset:TestDataset.test_current_batch_index_generator`, `test_dataset:TestDataset.test_current_batch_index`, maybe more)
- `API BREAKING CHANGE`: When the external programming api changed incompatibly
- `DATASET CONFIG BREAKING CHANGE`: When the dataset config (`.nv-meta` folder) changed incompatibly
- `METADATASET CONFIG BREAKING CHANGE`: When the metadataset config changed
- In a release, all breaking changes except checkpoint lead to a new major version.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Installation
If you simply want to use this package without modifying it, the best option is to install it
as a dependency of your project like you would with any other pip package.
## Normal Installation
To install the most recent release version, run
```shell
pip install megatron-energon
```
in your project's Python environment, which could be a virtualenv, or a conda environment.
You can even install it inside a `Dockerfile` to include it in your custom docker container.
If you want to use [remote datasets](../advanced/remote_dataset) or [audio/video decoding](av-decoder), you
need to provide *extras* to the installation command, for example like
```shell
pip install megatron-energon[s3,av_decode]
```
For all available extras, check out the above links and the `pyproject.toml` file.
## Installation for Development
If you want to manage, debug or modify the code of energon itself, we recommend that you clone this repository
on your disk.
You can then install the package in **editable** mode.
This way, you can use energon and its CLI scripts while still being able to modify the source code.
First, check out the repository locally:
```shell
git clone https://github.com/NVIDIA/Megatron-Energon.git megatron-energon
```
Then install with your favorite tooling:
### Editable installation with uv and just
* `uv` is a fast modern tool that can replace legacy tools like pip, conda and virtualenv.
* `just` is command runner that simplifies common tasks using the `justfile` we provide.
Check out the [official website](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer) on how to install `uv`.
On [this page](https://github.com/casey/just?tab=readme-ov-file#packages) you can find out how to install `just`.
Then, to setup a `.venv` and install energon in editable mode:
```shell
cd megatron-energon
just dev-sync
```
The `dev-sync` command will setup a local virtual environment in `.venv` and install all dependencies.
It will also install energon in editable mode for development inside that venv.
Activate the environment
```shell
. .venv/bin/activate
```
Now you can call the `energon` command.
You can also use `just` to do a bunch of other things shown below.
Note that you don't need to activate the venv before running those.
```shell
# Run all unit tests
just test
# Run the code linter and format check
just check
# Build the documentation
just docs
# Show all available commands
just help
```
### Editable installation with pip
First make sure you are in some python environment where you want to set up energon.
Then install in development mode:
```shell
pip install -e ./megatron-energon
```
```{warning}
**We discourage importing the cloned repo without pip install**
- You will not be able to use the command line tool
- You would have to use hacks to get the package into your `PYTHONPATH`
- You would need to take care of the dependencies yourself.
Instead, simply install in development mode.
```
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# General
Megatron-Energon is a data loader that works best with your [Megatron](https://github.com/NVIDIA/Megatron-LM) project.
However, you can use it in any of your PyTorch-based deep learning projects.
What can it offer compared to other data loaders?
The most important features are:
* Comes with a standardized WebDataset-based format on disk
* Optimized for high-speed multi-rank training
* Can handle very large datasets
* Can easily mix and blend multiple datasets
* Its state is savable and restorable (deterministic resumability)
* Handles various kinds of multi-modal data even in one training run
Energon also comes with a command line tool that you can use to prepare your datasets.
# https://github.com/casey/just
# List justfile recipes
help:
just --list
# Update the environment with the latest version of the dependencies
dev-sync:
uv sync --all-extras --cache-dir .uv_cache
# Update the environment but not with the development dependencies
prod-sync:
uv sync --all-extras --no-dev --cache-dir .uv_cache
# Fix the code style and format
fix: dev-sync
uv run ruff check --fix
uv run ruff format
uv run scripts/license_headers.py src --fix
uv run scripts/license_headers.py tests --fix
# Execute the ruff code linter and format checker
check: dev-sync
uv run ruff check
# Execute all unit tests
test: dev-sync
uv run -m unittest discover -v -s tests
# Build the docs
docs: dev-sync
uv run sphinx-build -b html docs/source docs/build
# Build the release package
build: dev-sync
rm -rf dist
uv build --wheel
uv build --sdist
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment