Commit f356f546 authored by maming's avatar maming
Browse files

Initial commit

parents
Pipeline #3339 canceled with stages
{"name":"Megatron Energon Dataloader Documentation","short_name":"Megatron Energon","icons":[{"src":"/android-chrome-192x192.png","sizes":"192x192","type":"image/png"},{"src":"/android-chrome-512x512.png","sizes":"512x512","type":"image/png"}],"theme_color":"#ED467A","background_color":"#411046","display":"standalone"}
\ No newline at end of file
<link rel="apple-touch-icon" sizes="180x180" href="{{ pathto('_static/apple-touch-icon.png', 1) }}">
<link rel="icon" type="image/png" sizes="32x32" href="{{ pathto('_static/favicon-32x32.png', 1) }}">
<link rel="icon" type="image/png" sizes="16x16" href="{{ pathto('_static/favicon-16x16.png', 1) }}">
<link rel="shortcut icon" href="{{ pathto('_static/favicon.ico', 1) }}">
<link rel="manifest" href="{{ pathto('_static/site.webmanifest', 1) }}">
\ No newline at end of file
{# Extend the base theme layout #}
{% extends "!layout.html" %}
{# Add custom favicon links to the head section #}
{% block extrahead %}
{{ super() }}
{% include "favicon.html" %}
{% endblock %}
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
(crude-data)=
# Crude Datasets and Auxiliary Data
As explained in [](sample-loading), the user has several options to choose how energon converts the raw (crude) data inside the tar files into Python sample objects (instances of a `Sample` dataclass) during loading.
When using crude datasets, this conversion happens through so-called "cookers", i.e. user-defined functions defined in the task encoder, as explained below.
In this case, the dataset on disk will specify neither the resulting sample type nor a sample loader for conversion, hence we call it "crude".
All of the conversion will happen in the user's code base.
## Setting Up a Crude Dataset with Cookers
Let's try it.
When you run `energon prepare` to prepare your dataset, you can pick "Crude sample" as the sample type.
If you already have an existing energon-compliant data set, you can modify it as follows (or create a copy of your `dataset.yaml` specifically for this use case and keep the original).
Original `dataset.yaml`:
```yaml
sample_type:
__module__: megatron.energon
__class__: OCRSample
sample_loader: sample_loader.py:sample_loader
part_filter: sample_loader.py:part_filter
```
New `dataset.yaml` (or `crude.yaml` of you like):
```yaml
__module__: megatron.energon
__class__: CrudeWebdataset
subflavors:
example_crude_selector: foobar
```
As shown above, [subflavors](sect-subflavors) can be assigned directly in the dataset yaml, however we recommend
to define them in the [Metadataset](../basic/metadataset) instead.
Subflavors can be used to differentiate the data origin when you mix different crude datasets or crude with non-crude data using a [Metadataset](../basic/metadataset).
You can then use the subflavors to determine how each sample shall be processed, as explained below.
Let's see how we set up our [Task Encoder](../basic/task_encoder) to cook the crude samples and turn them into real samples like {py:class}`TextSample <megatron.energon.TextSample>`.
A crude sample is essentially a dictionary ({py:class}`CrudeSample <megatron.energon.CrudeSample>` inherits from `dict`) and the loader will look for a {py:class}`Cooker <megatron.energon.Cooker>` that can handle and convert your crude samples.
You can provide cookers as follows:
```python
from megatron.energon import Cooker, basic_sample_keys
# ...
# We recommend to place your cooker functions in a separate file (esp. if they are larger)
def cook_text(sample: dict) -> TextSample:
return TextSample(
**basic_sample_keys(sample),
text=f">{sample['txt'].decode()}<",
)
class MyTaskEncoder(DefaultTaskEncoder[TextSample, TextSample, TextRawBatch, TextBatch]):
cookers = [
Cooker(cook_text, has_subflavors={"example_crude_selector": "foobar"}),
Cooker(...) # other cookers for other crude data if needed
]
# ...
```
In the example above, the cooker acts on all crude samples that have a subflavor `example_crude_selector` set to `foobar`.
If you leave out the `has_subflavors` argument, the cooker will apply to any sample.
The cooker will convert the dictionary to a {py:class}`TextSample <megatron.energon.TextSample>` by decoding the raw bytes and decorating the text with some nice angle brackets.
Probably you noticed the {py:meth}`basic_sample_keys <megatron.energon.task_encoder.cooking.basic_sample_keys>` helper that we inserted.
All it does, is to forward the key, restore key and flavors from the dict to the real sample. You will always need to forward these, or your dataset will not be restorable.
In a real use-case you will want to do a lot more here and we recommend keeping the cook methods in separate files and importing them where you define your TaskEncoder.
(aux-data)=
## Auxiliary Data for Polylithic Datasets
Using a crude dataset allows you to benefit from two other features of energon:
* Auxiliary Data
* Cache Pools
Both of which are often used in combination. A typical use case is online packing.
An **auxiliary data source** is an additional data source that supports random access and can be used to load data on-demand using its filename.
It is typically used with polylithic datasets where you have one primary dataset that contains only the text-based sample data
and one or more additional auxiliary data sources that contain the (larger) media data such as images or videos.
An auxiliary data source can be either
* Another energon-prepared WebDataset
* A folder on the local or a remote file system
You can specify it in your [metadataset](../basic/metadataset) yaml as follows (look at the `aux:` section)
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
path: ./my_pimary_ds
aux:
foo_bar_source: ./aux_ds123
fs_source: filesystem://./images
fs_source_abs: filesystem:///absolute/path/to/images
remote_source: msc://mybucket/path/ds
subflavors:
crude_type: my_dual_aux_example
```
The format is like
```yaml
aux:
NAME: PATH_OR_URL
NAME: PATH_OR_URL
...
```
You can specify multiple aux sources each of which can be one of
* Relative or absolute path to a local prepared energon dataset
* Relative or absolute path to a local folder (use the prefix `filesystem://`)
* Path to a remote prepared energon dataset (use prefix `msc://`)
* *[Planned future feature]*: Path to a remote folder (use prefix `filesystem+msc://`)
In your code, the cooker will automatically receive a {py:class}`FileStore <megatron.energon.FileStore>` reference to the data source as a keyword argument:
```python
from megatron.energon import FileStore
# ...
def cook_text(sample: dict, foo_bar_source: FileStore) -> TextSample:
additional_text = foo_bar_source.get(sample['add_txt_fname'])
return TextSample(
**basic_sample_keys(sample),
text=f"{sample['txt'].decode()} + {additional_text.decode()}",
)
# ...
```
You can use multiple sources. You'll have to specify a cooker argument for each source that was defined in the metadataset.
For easier debugging, you should always keep track of all the sources you used. The `get` method takes care of this if you pass it the sample like this:
```python
additional_text = foo_bar_source.get(sample['add_txt_fname'], sample)
```
This will update the sample-internal `__sources__` list with the aux dataset you used.
If you want, you can even use your primary dataset as an auxiliary dataset and look up files by name, yes! If you want to do that, you specify it in the cooker decorator and add an arg:
```python
from megatron.energon import cooker, FileStore
# ...
@cooker(need_primary=True)
def cook_text(sample: dict, primary: FileStore, foo_bar_source: FileStore) -> TextSample:
# ...
```
You can then retrieve files by their names from the primary dataset.
(cache-pools)=
## Cache Pools
Cache pools allow the user to defer the data transfer if the content will be needed at some point in the future but not immediately.
This is only needed if the media data is rather large and does not reside on a local disk, but rather on a network file system (e.g. lustre) or a remote file system (e.g. object storage).
Cache pools are especially beneficial if you are using buffers in your pipeline, like a shuffle buffer or a packing buffer. For example, when using [online packing](../advanced/packing), we may need to keep a buffer of several thousand samples to optimize for the best packing,
but we cannot keep several thousand images in memory, also we don't need the actual image content to optimize the packing.
Hence we will use auxiliary datasets as explained above.
However, at the time of filling the buffer, **we already know** that we **will need the image content in the future**, so cache pools can be used to **prefetch it in the background**.
Initially we want to load some information about the sample and its image but not the actual image pixels.
Later, when the packing is computed, we need to retrieve the pixel values.
In practice, this means the cooker will use a cache pool to queue the data retrieval from an auxiliary data source and obtain a lazy object (a handle to this future data). In a later stage (like {py:meth}`pack_selected_samples <megatron.energon.TaskEncoder.pack_selected_samples>`), the lazy object can be used to retrieve the content.
Ideally, in the mean-time, the cache pool has already downloaded the data to a local SSD.
### Using a Cache Pool
When calling {py:func}`get_savable_loader <megatron.energon.get_savable_loader>`,
we pass a cache pool as an additional argument:
```python
from megatron.energon import FileStoreCachePool
# ...
loader = get_savable_loader(
my_ds,
...,
cache_pool=FileStoreCachePool(
parent_cache_dir="/local_scratch/cache",
num_workers=1,
),
)
```
Then we tell the cooker decorator that we need access to the cache pool and use it to get a lazy object:
```python
from megatron.energon import cooker, FileStore, CachePool
from megatron.energon.av import AVDecoder
# ...
@edataclass
class TextVideoSample(Sample):
text: str
video: Lazy[AVDecoder]
@edataclass
class PackedTextVideoSample(Sample):
text: str
video: torch.Tensor
@cooker(need_cache=True)
def cook_video(sample: dict, video_source: FileStore, cache: CachePool) -> TextVideoSample:
# Previous non-cached version:
# video = video_source.get(sample['video_path'])
# Cached version:
video = cache.get_lazy(foo_bar_source, sample['video_path'])
return TextVideoSample(
**basic_sample_keys(sample),
text=sample['txt'].decode(),
video=video, # Pass the lazy object on
)
```
Later down the data processing pipeline, we can retrieve the data, for example here:
```python
@stateless
def pack_selected_samples(self, samples: List[TextVideoSample]) -> PackedTextVideoSample:
# Get the real object now:
video_data: AVDecoder = samples[0].video.get(samples[0])
return TextVideoSample.derive_from(
samples[0],
text=samples[0].txt,
video=video_data.get_video_clips([(0, 1), (19, 20)])[0],
)
```
There is a second option, e.g. if you want to combine a monolithic dataset with packing and caching: Use `cache.to_cache()` to move already loaded data to the cache:
```python
@cooker(need_cache=True)
def cook_video_monolithic(sample: dict, cache: CachePool) -> TextVideoSample:
# Previous non-cached version:
# video: AVDecoder = sample['mp4']
# Move the video to the cache, retrieve it later when it is needed again.
video: Lazy[AVDecoder] = cache.to_cache(
sample['mp4'],
sample['__key__'] + ".mp4", # Just a name for debugging
)
return TextVideoSample(
**basic_sample_keys(sample),
text=sample['txt'].decode(),
video=video, # Pass the lazy object on
)
```
\ No newline at end of file
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Customized Blending
In your Task Encoder you could customize the blend of datasets by overriding the `build_train_datasets` method as shown below.
```{warning}
This interface is not stable and may be subject of changes quite often for new features we add. So if you change
how the datasets are plugged together, consider that this may have to be adapted to future changes.
```
```py
class CaptioningTaskEncoder(
DefaultTaskEncoder[CaptioningSample, CaptioningSample, CaptioningRawBatch, CaptioningBatch]
):
...
def build_train_datasets(
self,
*,
datasets: List[Tuple[BaseCoreDatasetFactory[T_sample], float]],
worker_config: WorkerConfig,
batch_size: Optional[int],
batch_drop_last: bool = False,
packing_buffer_size: Optional[int] = None,
virtual_epoch_length: int = 0,
shuffle_buffer_size: Optional[int] = None,
) -> SavableDataset[T_batch]:
# The default implementation uses MixDataset, which mixes the datasets according to their weights
# This could be customized, e.g. to batch the datasets first (i.e. each batch only contains data from a single datset)
# and then blend, which would yield the same distribution.
dataset = BlendDataset(
*datasets,
worker_config=worker_config,
)
# Build batches from blended samples
dataset = self.build_batch(
dataset,
batch_size=batch_size,
batch_drop_last=batch_drop_last,
worker_config=worker_config,
)
# Optionally epochize
if virtual_epoch_length > 0:
dataset = EpochizeDataset(
dataset,
length=virtual_epoch_length,
worker_config=worker_config,
)
return dataset
```
\ No newline at end of file
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
(custom-sample-loader)=
# Custom Sample Loader
```{warning}
The custom sample loader is a legacy feature and using [crude datasets](../advanced/crude_datasets.md) with cookers is usually the preferred way.
This feature might be deprecated at some point in the future.
```
Instead of using a `field_map` in your `dataset.yaml`, you can create custom python code for sample loading
right next to your dataset inside the `.nv-meta` folder.
One reason for why we are deprecating this feature, is that you cannot easily version-control the code inside this folder.
In contrast, cookers live inside your code repository together with the task encoder.
Here's an example for your updated `dataset.yaml` if you want to use a sample loader:
```yaml
sample_type:
__module__: megatron.energon
__class__: OCRSample
sample_loader: sample_loader.py:sample_loader
part_filter: sample_loader.py:part_filter
```
In addition, you need to create a python file inside the `.nv-meta` folder. In this case it's
called `sample_loader.py`.
That file needs to contain the two methods referenced above:
```python
import torch
def sample_loader(raw: dict) -> dict:
data = raw["bbox.json"]
return dict(
__key__=raw["__key__"],
image=raw["jpg"],
text=raw["gt.txt"],
lines_boxes=torch.tensor([box["bbox"] for box in data], dtype=torch.int64),
lines_text=[box["text"] for box in data],
)
def part_filter(part: str) -> bool:
return part in ("bbox.json", "gt.txt", "jpg")
```
Your `sample_loader` method must accept a dictionary as argument and return a dictionary. It directly operates on
the webdataset samples and the resulting dictionary keys should map to the corresponding sample class members.
In this case an {py:class}`OCRSample <megatron.energon.OCRSample>`.
With the optional `part_filter` method, you can prevent some webdataset fields from being loaded.
Given a field name, the method should return True if the field is to be kept.
(interleaved-sample-loader)=
## Example: Interleaved Data and Arbitrary Image Count
### The Webdataset Structure
If you need multiple files with an arbitrary number of data per sample, e.g. multiple image / video / audio files, this shows a blueprint for how to setup your webdataset tar files and how to load that webdataset with Energon.
The structure of the shard files could be like this:
`tar -tvf shard_0.tar`:
```python
sample_000001.2345ew.jpg
sample_000001.json
sample_000002.35tags.jpg
sample_000002.as23ds.jpg
sample_000002.gd1dtg.jpg
sample_000002.gds233.jpg
sample_000002.json
sample_000002.sdag42.jpg
sample_000003.json
sample_000004.asf234.jpg
sample_000004.json
```
where the structure of a json file is:
`tar -xf shard_0.tar sample_000001.json -O`:
```json
{
"images": [null, "2345ew.jpg", null],
"texts": ["This is some text, an image is following.", null, "More text after the image."],
}
```
Note that the image path corresponds to the filename of the image after the first "." in the sample. This is all part of the extension as defined by webdataset. Everything before the first "." is part of the sample key and must be equal to match into the same group.
### Usage with Energon
To make this work with Energon, in the `energon prepare` [CLI preparation tool](energon-prepare), you can either tell the wizard to create a custom sample loader template for you, or change the files accordingly. Here is the example with the structure above:
`.nv-meta/dataset.yaml`:
```yaml
sample_type:
__module__: megatron.energon
__class__: InterleavedSample
part_filter: sample_loader.py:part_filter
sample_loader: sample_loader.py:sample_loader
```
`.nv-meta/sample_loader.py`:
```python
import torch
def sample_loader(raw: dict) -> dict:
# Note that the images are already decoded, as well as the json part.
return dict(
__key__=raw["__key__"],
sequence=[
(raw[image] if text is None else text)
for image, text in zip(raw["json"]["images"], raw["json"]["texts"])
],
)
def part_filter(part: str) -> bool:
# Need to load all parts
return True
```
When iterating, you'll get those samples as `InterleavedSample` which either contains the image tensor, or the strings for text. The first sample would look like this:
```python
InterleavedSample(
sequence=["This is some text, an image is following.", torch.Tensor(...) or PIL.Image.Image(), "More text after the image."]
)
```
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Epochized Blending
As an alternative to blending with a weight for each dataset, blending can be made accurate and
iterating the dataset can follow epochs (i.e. interrupt iteration after an epoch) with this concept.
Here is an example `metadataset.yaml` config file that changes to epochized blending:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
# Blend the following datasets, repeating coco 5 times, coyo-train 2 times and coyo-val 1 times
blend_epochized:
- repetitions: 5
path: ./coco
# ... Other parameters
- repetitions: 2
path: ./coyo
- repetitions: 1
path: ./coyo
split_part: val
```
Now, the call to `get_train_dataset` requires the additional parameter `repeat=False` to interrupt iterating after one epoch:
```py
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
loader = get_loader(get_train_dataset(
'metadataset.yaml',
batch_size=2,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
repeat=False,
))
# This will now stop iterating after the datasets have been iterated (coco 5 times, coyo-train 2
# times and coyo-val 1 times). Of course, the data is still being shuffled between all those
# datasets.
for batch in loader:
print(batch)
# This will iterate the second epoch
for batch in loader:
print(batch)
```
If used as dataset for `get_val_dataset`, the `repetitions` are ignored.
The metadataset would also work without setting `repeat=False`, but then the shuffle buffer will shuffle samples across bounderies of epochs.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Grouping
Grouping allows for rule-based batching of samples into one batch on the fly.
Note how this is different from [packing](packing) which joins multiple samples into one (and is done before batching).
On the other hand, grouping is an alternative to standard batching.
## Example use cases
* Select samples to batch based on image resolution, so that only samples of the same size are in one batch
* Select blended samples based on their dataset origin, so that one batch does not mix different tasks or data types
## How to group
To use grouping, you need to define the method {py:meth}`batch_group_criterion <megatron.energon.TaskEncoder.batch_group_criterion>` in your custom task encoder.
This method gets a sample and returns a hashable value that will be used to cluster/group the samples
and it also returns the batch size for that group.
Samples with the same batch group criterion will be batched together. Once enough samples for one group
have been collected (reached the batch size for that group), they will be batched and pushed down the pipeline
to the next processing step.
Here's an example task encoder that batches samples based on their image aspect ratios:
```python
class GroupingTaskEncoder(DefaultTaskEncoder):
def batch_group_criterion(self, sample: CaptioningSample) -> Tuple[Hashable, Optional[int]]:
aspect_ratio = sample.image.shape[2] / sample.image.shape[1]
# Bin aspect ratios into 3 groups
if aspect_ratio < 0.8:
return "portrait", 8
elif aspect_ratio < 1.2:
return "square", 8
else:
return "landscape", 8
```
In the example, the aspect ratio is sorted into one of three bins and a string is used as the grouping key.
The batch size used here is always 8.
Here is another example where each batch contains only images with the exact same size.
Note how the image shape itself is used as the grouping key.
```python
class GroupingTaskEncoder(DefaultTaskEncoder):
def batch_group_criterion(self, sample: CaptioningSample) -> Tuple[Hashable, Optional[int]]:
batch_size = 4 if sample.image.shape[1] < 512 else 2
return sample.image.shape, batch_size
```
For images with a height of less than 512 pixels, the batch size will be 4, for larger images it's reduced to 2.
## Fixed global batch size
Instead of specifying the batch size for each group individually, you can also specify the batch size as usually when calling
`get_train_dataset`. The `batch_group_criterion` method should then return `None` for the batch_size.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Joining Datasets
Joining is needed when you have two or more corresponding datasets, each dataset having some parts to compose each sample.
```{image} ../images/joining.png
:alt: Joining Datasets
:class: bg-primary
:width: 550px
:align: center
```
Dataset joining is analogous to webdataset's "add column", i.e., merging two datasets which require the same files within the tar files.
## Example File Structure
Here is an example structure of the datasets within the tar files:
```
ds1
├── .nv-meta
│ ├── .info.json
│ ├── split.yaml
│ └── dataset.yaml
├── shard1.tar
│ ├── 0001.jpg
│ ├── 0002.jpg
│ └── 0003.jpg
├── shard1.idx
ds2
├── .nv-meta
│ ├── .info.json
│ ├── split.yaml
│ └── dataset.yaml
├── shard1.tar
│ ├── 0001.txt
│ ├── 0002.txt
│ └── 0003.txt
├── shard1.idx
```
In this example, `ds1/shard1.tar` and `ds2/shard1.tar` contain files with the same base names / keys.
When joining, the files from both datasets are joined based on their keys, which must be unique within each dataset across all the shards. Each dataset must be [prepared](../basic/data_prep) (i.e. .nv-meta created) with a recent version of energon.
## Minimal Example
Below is a minimal meta dataset example for joining.
`metadataset_minimal.yaml`:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
join:
- path: ds1
- path: ds2
joiner:
__module__: megatron.energon
__class__: CaptioningSample
```
Note that the blending is omitted here, as it is optional. Joining is of course also optional.
```{warning}
All metadatasets that contain a `join` must be prepared with the `energon prepare` command.
This will compute the join index and store it next to the metadataset in a folder with a similar name.
```
## Join Modes
When joining two datasets, it may happen that the first dataset (primary dataset) has more samples or fewer samples than the secondary dataset(s).
In this case, we have to decide how to handle these samples that do not match.
The primary dataset always serves as the reference and there will never be more samples in the join result than in the primary dataset. However if a primary sample has no match in a secondary dataset, it may be skipped as explained below.
For each of the secondary datasets, the user can specify a `nonmatch` setting.
With one of the following options, the user can decide what happens, if a sample from the primary dataset is not found in the given secondary dataset:
* `error` (default): An error is raised
* `skip`: The whole sample is skipped
* `none`: The column for the current secondary dataset is filled with `None` if there's no match
Example `metadataset_nomatch.yaml`:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
join:
- path: ds1
- path: ds2
nonmatch: skip
- path: ds3
nonmatch: none
joiner:
__module__: megatron.energon
__class__: CaptioningSample
```
To illustrate the effect, let's look at some example data:
* `ds1` samples: `s1`, `s2`, `s3`, `s5`, `s6`
* `ds2` samples: `s1`, `s3`, `s4`, `s6`, `s7`
* `ds3` samples: `s1`, `s2`, `s3`, `s100`
The resulting joined data would contain the following samples, one in each row:
| ds1 | ds2 | ds3 |
| --- | --- | ---- |
| s1 | s1 | s1 |
| s3 | s3 | s3 |
| s6 | s6 | None |
Explanation:
* The sample key `s1` is available in all dataset.
* `s2` is missing from `ds2` and nonmatch is set to `skip`, so the sample will not appear in the result.
* `s3` is available in all datasets.
* `s4` is not in the primary dataset. Only samples from the primary dataset will be included.
* `s5` is missing from `ds2` again, and this time also from `ds3`
* `s6` is missing from `ds3` and `ds3` has `nonmatch` set to `none`, so the sample is not skipped, but the column for `ds3` is set to `None`
## Extensive Example
Here is a more extensive example that shows multiple things at once:
* Joining can be used inside blending
* The datasets to be joined can have custom subflavors or dataset yamls specified
* A custom "joiner" can be specified to define how samples are joined and what the resulting type is
* The `nonmatch` setting is not included here, but would work just like shown above
`metadataset_extended.yaml`:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
blend:
- weight: 1
join:
- path: ds1
dataset_config: dataset.yaml # If override is needed
- path: ds2
dataset_config: dataset.yaml
subflavors: # If needed, will be merged(overriding) with parent subflavor
ds2_extra: 2
split_config: split.yaml
joiner:
__module__: my_module
__class__: JoinedSample # Type should implement from_joined(ds1, ds2)
split_config: split.yaml # Sets this for all joined datasets
split_part: train # Sets this for all joined datasets
subflavors: # Sets this for all joined datasets (it will be merged with their individual subflavors)
source: metadataset.yaml
src: ds1
```
## Custom Join Type
To define a custom join type, you can create a Python class as shown below in `my_module.py`:
```python
from dataclasses import dataclass
import torch
from megatron.energon import Sample, TextSample
@dataclass
class JoinedSample(Sample):
text1: torch.Tensor
text2: torch.Tensor
@staticmethod
def from_joined(ds1: TextSample, ds2: TextSample) -> "JoinedSample":
return JoinedSample.derive_from(
ds1,
text1=ds1.text,
text2=ds2.text,
)
```
This class should implement the `from_joined` method to combine samples from `ds1` and `ds2`.
Note: It is important to use `derive_from` with the first argument being the first sample, as this will guarantee that the state can be saved and restored. It ensures that all the internal keys of the sample are retained.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Packing
Packing (sometimes also called sequence packing), enables you to selectively compress multiple
input samples into a single sample, for example depending on their length.
This technique is commonly used with large language models, if the input samples have very different
lengths leading to lots of padding and hence wasted compute.
This section explains how you can pack samples together and utilize the full context length.
## How to pack samples on the fly
To use packing, you need to implement the TaskEncoder methods {py:meth}`select_samples_to_pack <megatron.energon.TaskEncoder.select_samples_to_pack>`
and {py:meth}`pack_selected_samples <megatron.energon.TaskEncoder.pack_selected_samples>`.
Furthermore, you need to initialize the loader with the `packing_buffer_size` argument set to a non-zero number.
The `select_samples_to_pack` method will receive a list of samples (size according to the selected `packing_buffer_size`),
and should partition those samples into groups that shall be packed together. Hence the function returns
a list of lists of samples.
For each group, the second method `pack_selected_samples` will be called. You need to implement how a group of
samples will be mapped to a single sample. In terms of LLMs for example, this method might concatenate the input tokens.
```{admonition} Note
:class: important
You can set the `__restore_key__` of the packed sample to an empty tuple, since energon will set the correct
restore key afterwards, based on the samples that went in.
```
```{warning}
To handle attention masks and tokenized inputs, you will want to operate on a different sample type.
The `pack_selected_samples` method may return a different sample type that is expected as the input for the `batch` method.
```
It is important, to mark custom functions like `encode_sample` and `pack_selected_samples` as `@stateless` to allow saving
samples for packing. If augmentations happen, it should be marked with
`@stateless(restore_seeds=True)`, to deterministically set the seeds based on the `TaskEncoder.current_sample_index`.
You have to make sure the methods are actually stateless, meaning that they will produce the same output when invoked
with the same input and random states.
Example packing for a large language model extending the example from the [](../basic/task_encoder) section:
```python
class PackingCaptioningTaskEncoder(CaptioningTaskEncoder):
"""This class extends the CaptioningTaskEncoder and adds select_samples_to_pack and pack_selected_samples for packing samples
efficiently on-the-fly.
Set the `packing_buffer_size` of the get_(train|val)_dataset to an accordingly large number to get a
properly sized input sample buffer with good diversity.
"""
@stateless(restore_seeds=True)
def encode_sample(self, ...):
# Added `stateless` decorator to allow saving samples for packing. Will set the seed
# deterministically based on the self.current_sample_index.
...
def select_samples_to_pack(self, samples: List[CaptioningSample]) -> List[List[CaptioningSample]]:
# Do something intelligent here, e.g. sort by caption length and concat where possible.
# This could be better, but it's just an example.
samples.sort(key=lambda x: len(x.caption))
groups = []
while len(samples) > 0:
batch = []
caption_len = 0
while len(samples) > 0 and caption_len + len(samples[0].caption) < self.max_length:
sample = samples.pop(0)
batch.append(sample)
caption_len += len(sample.caption)
groups.append(batch)
return groups
@stateless
def pack_selected_samples(self, samples: List[CaptioningSample]) -> CaptioningSample:
# Construct a new CaptioningSample by concatenating the captions
...
```
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Parallelism
Neural network parallelism can be categorized into several types:
1. **Data Parallelism** (DP): This involves splitting the data across multiple processors and performing the same operation on each subset of the data. It is commonly used to increase the global batch size.
2. **Model Parallelism**: In this approach, different parts of the model are distributed across multiple processors. This is useful when the model itself is too large to fit into the memory of a single processor.
3. **Pipeline Parallelism** (PP): This technique involves breaking down the model into different stages and processing different mini-batches of data through these stages in a pipeline fashion. It helps in improving the utilization of resources and reducing idle time.
4. **Tensor Parallelism** (TP): This method splits individual tensors (weights and activations) across multiple devices. It is particularly effective for very large models where even a single layer cannot fit into the memory of one device.
These parallelisms have different consequences for the dataloader:
- **Data Parallelism** (DP): The dataloader needs to ensure that each processor gets a different subset of the data. This is supported by Energon. The data parallel groups should be specified in the worker config.
- **Pipeline Parallelism** (PP): Data is typically only loaded on the first Pipeline Parallel rank, and propagates through the other ranks within the pipeline parallel group. This means, you only instantiate an Energon dataset and loader on the first ranks of those groups.
- **Tensor Parallelism** (TP): The dataloader will load the same input data on multiple devices. Typically, this can be ensured by either instantiating the dataloader exactly the same on the same data parallel ranks in different data parallel groups, or e.g. by loading the data only once and distributing it using torch distributed.
## Example
Example with the following ranks and worker configuration (Data Parallel = 2, Pipeline Parallel = 2, Tensor Parallel = 2):
* `Global Rank 0`: `DP Rank = 0` (DP group A), `PP Rank = 0`, `TP Rank = 0`
* `Global Rank 1`: `DP Rank = 0` (DP group B), `PP Rank = 0`, `TP Rank = 1`
* `Global Rank 2`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 0`
* `Global Rank 3`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 1`
* `Global Rank 4`: `DP Rank = 1` (DP group A), `PP Rank = 0`, `TP Rank = 0`
* `Global Rank 5`: `DP Rank = 1` (DP group B), `PP Rank = 0`, `TP Rank = 1`
* `Global Rank 6`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 0`
* `Global Rank 7`: `DP Rank = X` (No DP group), `PP Rank = 1`, `TP Rank = 1`
When saving the state of the data loader, we only need to store the states
of global ranks 0 and 4, i.e. the fist DP group "A".
Ranks 1 and 5 will have the same state as they are duplicates.
When restoring the state, global ranks 0, 1, 4, 5 need to receive a state.
There are different ways to achieve this. The following example illustrates how the state
can be saved and restored in a distributed setting.
```py
import torch
from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig
# Initialize the process group
torch.distributed.init_process_group(backend='nccl')
# Get the DP, PP, TP ranks
global_rank = torch.distributed.get_rank()
data_parallel_rank = [0, 0, None, None, 1, 1, None, None][global_rank]
pipeline_parallel_rank = [0, 0, 1, 1, 0, 0, 1, 1][global_rank]
tensor_parallel_rank = [0, 1, 0, 1, 0, 1, 0, 1][global_rank]
if global_rank in (0, 4):
# DP Group A
# If on rank 0 or 4, the DP group consists of those ranks (each representing DP ranks 0 and 1).
data_parallel_group = torch.distributed.new_group(ranks=[0, 4])
elif global_rank in (1, 5):
# DP Group B
# If on rank 1 or 5, the DP group consists of those ranks (each representing DP ranks 0 and 1).
data_parallel_group = torch.distributed.new_group(ranks=[1, 5])
else:
data_parallel_group = None
if data_parallel_rank is not None:
assert pipeline_parallel_rank == 0, "Only Pipeline Parallel ranks 0 load data"
# Set the worker config correspondingly
worker_config = WorkerConfig(
rank=data_parallel_rank,
world_size=torch.distributed.get_world_size(data_parallel_group),
num_workers=3,
data_parallel_group=data_parallel_group,
)
# Create the loader with that config
loader = get_savable_loader(get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=worker_config,
))
# Iterate the data
for i, batch in zip(range(10), loader):
# Do forward-backward pass
print(batch)
break
if tensor_parallel_rank == 0:
# Save the state only for the first TP rank (the other TP ranks have a copy of that state)
# Save the state
state = loader.save_state_rank()
# E.g. save to disk with torch
torch.save(state, f"dataloader_rank{data_parallel_rank}.pt")
# Alternatively, save once for the whole dp group:
# state = loader.save_state_global(global_dst_rank=0)
# if state is not None:
# torch.save(state, "dataloader.pt")
# ... when loading:
if data_parallel_rank is not None:
assert pipeline_parallel_rank == 0, "Only Pipeline Parallel ranks 0 load data"
# Restore the state for a new loader
loader = get_savable_loader(get_train_dataset(
'coyo-coco-dataset.yaml',
batch_size=4,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
worker_config=worker_config,
))
# E.g. load from disk as saved above
state = torch.load(f"dataloader_rank{data_parallel_rank}.pt")
# Restore the state
loader.restore_state_rank(state)
# Alternatively, when using a global checkpoint,
# load the checkpoint from disk on every dp rank:
# state = torch.load("dataloader.pt")
# loader.restore_state_global(state)
# Or load only once from disk for each dp group:
# if data_parallel_rank == 0:
# state = torch.load("dataloader.pt")
# else:
# state = None
# loader.restore_state_global(state, src_rank=0)
```
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Remote Dataset
Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on [Multi Storage Client (MSC)](https://github.com/NVIDIA/multi-storage-client).
This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called _MSC URL_.
## Prerequisites
For using a remote dataset, install energon with one or more of the extras:
* `s3`
* `aistore`
* `azure-blob-storage`
* `google-cloud-storage`
* `oci`
like this:
```sh
pip install megatron-energon[s3,oci]
```
Set up the msc config as described in [Multi Storage Client documentation](https://nvidia.github.io/multi-storage-client/).
You can also use the rclone config with msc, as was described prior to 5.2.0.
For fast data loading we recommend to activate MSC local caching:
```yaml
cache:
size: 500G
use_etag: true
eviction_policy:
policy: "fifo"
refresh_interval: 3600
cache_backend:
cache_path: /tmp/msc_cache # prefer to use local NVMe, but Lustre path also works
```
And point MSC to the config with
```sh
export MSC_CONFIG=/path/to/msc_config.yaml
```
## The URL syntax
The syntax is a simple as
```
msc://CONFIG_NAME/PATH
```
For example:
```
msc://coolstore/mainbucket/datasets/somedata
```
You can use this URL instead of paths to datasets in
* Functions like `get_train_dataset`, `get_val_dataset`
* Inside [metadataset](../basic/metadataset) specifications
* As arguments to `energon prepare` or `energon lint`. Note that those may be slow for remote locations.
* Or as a path to [`energon mount`](energon-mount) to locally inspect your remote dataset 😎
Example usage:
```python
ds = get_train_dataset(
'msc://coolstore/mainbucket/datasets/somedata',
batch_size=1,
shuffle_buffer_size=100,
max_samples_per_sequence=100,
)
```
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Reproducible Scaling
A special use case is to re-run or continue a training run with the exact same data order, but using a different number of nodes or ranks.
Since version 2.0.0, Megatron Energon supports this behavior if a few constraints are met:
* The energon major version must be the same across runs
* The global batch size must stay the same across runs
* The global batch size must be a multiple of `micro-batch size * world_size * num_workers`
* The multiple of that is the number of gradient accumulation steps in your training
* The product `world_size * num_workers` must stay the same across runs, such that the global number of workers stays the same
* When using random seed offsets in your {py:class}`WorkerConfig <megatron.energon.WorkerConfig>`, those need to be the same
By obeying these rules, you will be able to reproduce the same global batches. Let's look at an example.
| Name | Global batch size | Micro batch size | World size | Number of Workers | Gradient accumulation steps |
| ----- | ----------------- | ---------------- | ---------- | ----------------- | --------------------------- |
| Run 1 | 8 | 2 | 4 | 1 | 1 |
| Run 2 | 8 | 2 | 1 | 4 | 4 |
Iterating the dataset will yield the same global batches for both of these runs, if the seed is set correctly.
In practice, you will need to adapt your worker config accordingly.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Dataset Subsets
Dataset subsets allow restricting a dataset (or parts of a metadataset hierarchy) to a specific portion of the available samples.
This is useful for rapid prototyping, ablation studies, different training stages, or constructing disjoint train/validation/test splits that differ from the original dataset configuration.
A subset is defined by a two-element `range` list consisting of `[start, end]` (where `start` is inclusive, `end` exclusive).
Each element can be either
* a **percentage** string (e.g. `"0%"`, `"12.5%"`, `"100%"`) – interpreted relative to each inner
dataset size, or
* an **absolute** integer – interpreted as a sample index. Absolute indices are only allowed for
*leaf* datasets (`path` to a prepared dataset containing `.nv-meta`).
## Basic example
The snippet below keeps the first 80 % of *COYO* `train` split (as defined in the `split.yaml`) for training while
evaluating on the remaining 20 % of the `train` split. Note how the `subset` key is placed directly next to the corresponding `path`.
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
path: ./coyo
subset: {range: [0%, 80%]}
val:
path: ./coyo
split: train
subset: {range: [80%, 100%]}
```
## Nested subsets and merging rules
Subsets can appear at any level that ultimately yields samples
(direct `path` reference to a prepared dataset containing `.nv-meta`, `join`, `blend`, `blend_epochized`).
When multiple subsets are nested, the *inner* subset is applied first, then the portion selected by the *outer* subset is applied *within* the already selected range.
For percentages the ranges are composed multiplicatively.
Example: the outer subset `[0%, 50%]` followed by an inner subset `[25%, 75%]` results in the final
range `[25%, 50%]` of the original dataset.
Absolute indices short-circuit merging: they can **only** be specified at the leaf level and must
not be combined with another absolute range farther up the hierarchy.
## Absolute ranges
Absolute indices are handy when exact sample counts are required.
## Advanced examples
The following configuration combines the absolute ranges with the nested rules. The inner subset takes
the first **1000** samples from *COCO* train split and mixes them with the full *COYO* train split using
weight-based blending. The outer nesting then reduces the inner ranges to the first 50%, thus only taking
the first **500** samples of *COCO*, mixed with the first **50%** of the *COYO* dataset effectively.
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
# The subset is applied to each blended dataset separately.
# I.e. for the first, the sample range is [0, 500], for the second the range is [0%, 50%]
subset: {range: [0%, 50%]}
blend:
- weight: 1.0
path: ./coco
subset:
# Take exactly 1000 samples (indices 0-999)
range: [0, 1000]
- weight: 1.0
path: ./coyo
```
Absolute ranges can also be specified to run up to the end of the dataset using the `end` keyword:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
train:
path: ./coyo
subset: {range: [1422, end]}
```
## Python usage
No API changes are required on the Python side – subsets are fully specified in the YAML. Simply
load the dataset with the regular helpers.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Command-Line Interface
After you [installed](../intro/installation) energon, a script called `energon` will be added to your PATH.
It provides commands to prepare, preview, or lint datasets on disk.
Here's a simple example:
```shell
energon prepare /mnt/data/my_captioning_webdataset
```
The above command will scan your existing off-the-shelf [web dataset](https://webdataset.github.io/webdataset/)
and add the [needed metadata](data-on-disk) to make it compatible with Energon.
Below, you can see the available sub-commands under `energon`.
```{eval-rst}
.. click:: megatron.energon.cli.main:main
:prog: energon
:nested: short
```
(energon-prepare)=
## energon prepare
An interactive tool to generate metadata for your existing webdataset (or [jsonl dataset](create-jsonl-dataset)).
This will help make the dataset compliant with our [format](data-on-disk).
The tool will ask you for a train/val/test split and how to assign the webdataset fields to the
fields of the corresponding sample type in Energon.
See [Data Preparation](../basic/data_prep) for more details on how to use this command.
## energon info
Prints information about the dataset such as overall number of samples and size.
It also prints the energon version that was used to prepare the dataset, if a recent version was used.
## energon lint
You can execute this tool on the prepared dataset to check if the data is valid and loadable.
It will report any problems such as non-readable images.
(energon-mount)=
## energon mount
Use this to mount your [prepared dataset](../basic/data_prep) as a virtual read-only filesystem and inspect it using `ls` or other file browsing tools.
It is as simple as running
```shell
energon mount /PATH/TO/DATASET ./MY_MOUNT_POINT
```
This will leave the process in the foreground and the mount will exist as long as the program is running.
If you want to detach the process to the background, use the `-d` or `--detach` flag.
Two modes are supported by `energon mount`:
| | Flat mode (default) | Sample folder mode (flag `-s`) |
| --- | --- | --- |
| Description | All files from all shards listed at<br/>the root of the mount point. | One folder per sample key,<br/>each folder containing files<br/>named by the sample part extension |
| Example | `001.jpg`<br/>`001.txt`<br/>`002.jpg`<br/>`002.txt`<br/>`...` | `001/`<br/>`┣ jpg`<br/>`┗ txt`<br/>`002/`<br/>`┣ jpg`<br/>`┗ txt`<br/>`...` |
```{warning}
You should not use the same sample keys in multiple shards of the same dataset.
If you do, `energon mount` will not work as intended and it will display WARNING files in the virtual mount.
```
## energon preview
This command will load a dataset and display samples one-by-one on the console.
Note that this will not work for datasets with non-standard flavors or crude datasets.
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# Packages and Modules
```{toctree}
---
maxdepth: 2
---
modules_data
```
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->
# megatron.energon
```{eval-rst}
.. automodule:: megatron.energon
:members:
:undoc-members:
:show-inheritance:
.. automodule:: megatron.energon.task_encoder.cooking
:members:
:undoc-members:
:show-inheritance:
```
# megatron.energon.av
```{eval-rst}
.. automodule:: megatron.energon.av
:members:
:undoc-members:
:show-inheritance:
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment