_**DISCLAIMER**: This package contains research code. APIs may change._
# What is this?
**Megatron Energon** is the multi-modal data loader of [Megatron](https://github.com/NVIDIA/Megatron-LM) (you can also use it independently).
It's best at
- loading large training data to train large multi-modal models
- blending many different datasets together
- distributing the work across many nodes and processes of a cluster
- ensuring reproducibility and resumability
- adapting easily to various types of data samples and processing
Try using it together with [Megatron](https://github.com/NVIDIA/Megatron-LM) Core.
# Quickstart
**Megatron Energon** is a pip-installable python package that offers
- dataset-related classes that you can import in your project
- a command line utility for data preprocessing and conversion
This document is just a quick start. Please also check out the [documentation](https://nvidia.github.io/Megatron-Energon/).
## Installation
To install the latest stable version:
```shell
pip install megatron-energon
```
Or to install the current development version:
```shell
pip install git+https://github.com/NVIDIA/Megatron-Energon.git
```
**NOTE**: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.
For more details on installing this package, see [here](https://nvidia.github.io/Megatron-Energon/intro/installation.html).
## Usage of command line tool
After installation, the command `energon` will be available.
Here are some examples for things you can do:
| Command | Description |
|---|---|
| `energon prepare DATASET_ROOT` | Take an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset |
| `energon lint DATASET_ROOT` | Verify that the dataset complies with the energon dataset format and that all samples are loadable |
## Usage of the library
To get started, pick a [WebDataset](https://github.com/webdataset/webdataset)-compliant dataset and run `energon prepare DATASET_ROOT` on it, to run the interactive assistant and create the `.nv-meta` folder. As an alternative to WebDataset, Energon also supports the JSONL format, see [here](https://nvidia.github.io/Megatron-Energon/basic/data_prep.html).
Once done, try to load it from your Python program:
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
simple_worker_config = WorkerConfig(rank=0, world_size=1, num_workers=2)
train_ds = get_train_dataset(
'/my/dataset/path',
batch_size=2,
shuffle_buffer_size=None,
max_samples_per_sequence=None,
worker_config=simple_worker_config,
)
train_loader = get_loader(train_ds)
for batch in train_loader:
# Do something with batch
# Infer, gradient step, ...
pass
```
For more details, read the [documentation](https://nvidia.github.io/Megatron-Energon/).