quickstart.md 12.5 KB
Newer Older
maming's avatar
maming committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->

# Quickstart

You can use Megatron Energon to load datasets in the energon dataset format. This format is essentially [WebDataset](https://github.com/webdataset/webdataset) with some meta data added.
Since version 7.2.0, we also support [JSONL](create-jsonl-dataset) as a simpler format with fewer features.

For a moment let's assume you already have prepared a dataset in the needed format, and it's stored on
disk at `/my/dataset/path`. If you want to create a dataset now, check out [](data_prep).

If you simply want to use some dummy data for trying this out, checkout the unit test method `create_captioning_test_dataset` inside `tests/test_dataset.py`.

```{admonition} Good to know
:class: tip
You can also store your dataset inside an S3-compatible object store and load it from there! See [](../advanced/remote_dataset)
```

You can then load the dataset like this:
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig

ds = get_train_dataset(
    '/my/dataset/path',
    batch_size=1,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)

loader = get_loader(ds)

for batch in loader:
    # Do something with batch
    # Infer, gradient step, ...
    pass
```

At first, we call {py:meth}`get_train_dataset <megatron.energon.get_train_dataset>` (click to see signature).
The method will check what kind of dataset is on disk and instantiate the correct class for it.

A worker configuration is always needed to specify how the work is distributed across multiple ranks and workers.
In this simple example, we use a helper method {py:meth}`default_worker_config <megatron.energon.WorkerConfig.default_worker_config>` to get reasonable default values.

The dataset should not be iterated directly, but used with a loader which handles the worker processes.
The batches will contain samples of the sample type specified in the [task encoder](task_encoder).

```{admonition} Good to know
:class: tip
Since we did not specify a task encoder above, the {py:class}`DefaultTaskEncoder <megatron.energon.DefaultTaskEncoder>` will be used.
It will not transform the data. For batching it will use common sense magic to pad and stack tensors or build lists if the type is unknown.
```

_Wait. Why does the dataset create batches? Shouldn't the dataloader do that?_

Energon will create batches at dataset level.
Internally, most of the cool things that energon can do (such as blending datasets together, [sequence packing](../advanced/packing), etc.)
are dataset wrappers. Even the process of batching is such a wrapper and the default {py:meth}`get_train_dataset <megatron.energon.get_train_dataset>`
function will construct a suitable combination of all these based on the arguments you pass to that function.
Check out the [](basics_flow) section to see the steps in which the data is processed.

_Why must `shuffle_buffer_size` and `max_samples_per_sequence` be set explicitly?_

As the library is designed to work on (sequential) webdatasets but still wants to provide proper shuffling, these parameters are required. To make sure, the user does not forget to set these, we enforce them to be set explicitly.
A value of 100 for both settings for image datasets seems to work well (i.e. balanced shuffling randomness vs seeking performance impact), but datasets where the samples are lots larger or smaller might require different settings.
Setting the sequence length to a very small size compared to the number of samples in the dataset will result in more random access, thus slowing down dataloading, so the recommendation is to set it to a high enough value.
At the same time, a high value reduces the shuffling randomness, which requires a larger shuffle buffer size to compensate for that (i.e. higher memory footprint and longer state restore times).

## Tutorial 1: Preparing the Dataset

For running any tutorials, you'll need your data structured as webdataset. For more details and more options check out [](data_prep).

For example, in your folder containing the tar files, run
```sh
$ energon prepare ./

# Example answers to interactive questions:
Ratio: 8,1,1
Dataset class: CaptioningWebdataset
Field map: Yes
  image: jpg
  caption: txt  # if txt contains the caption
# or
  caption: json[caption]  # if .json contains {"caption": "My nice image"}
```


## Tutorial 2: Loading a Dataset

Let's be a bit more concrete and try out the above data loading code with a real dataset.
We are going to print the first batch and stop.

```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig

ds = get_train_dataset(
    '/path/to/your/dataset',
    batch_size=None,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)

loader = get_loader(ds)

for batch in loader:
    print(batch)
    break
```

This prints

```text
CaptioningSample(__key__='part_00123/00403.tar/004030195', image=tensor([[[0.4549, 0.4784, 0.5059,  ..., 0.8392, 0.8275, 0.7961],
          [0.4549, 0.4784, 0.5020,  ..., 0.6431, 0.6275, 0.5882],
          [0.4510, 0.4706, 0.4941,  ..., 0.6235, 0.6353, 0.6078],
          ...,
          [0.4471, 0.4196, 0.4510,  ..., 0.8471, 0.8039, 0.8275],
          [0.4667, 0.4353, 0.4667,  ..., 0.8196, 0.7804, 0.8078],
          [0.4824, 0.4549, 0.4824,  ..., 0.8196, 0.7843, 0.8118]],

         [[0.3608, 0.3843, 0.4118,  ..., 0.7373, 0.7255, 0.6941],
          [0.3608, 0.3843, 0.4078,  ..., 0.5412, 0.5255, 0.4863],
          [0.3569, 0.3765, 0.4000,  ..., 0.5098, 0.5216, 0.4941],
          ...,
          [0.3608, 0.3333, 0.3647,  ..., 0.7529, 0.7098, 0.7333],
          [0.3804, 0.3490, 0.3804,  ..., 0.7255, 0.6863, 0.7137],
          [0.3961, 0.3686, 0.3961,  ..., 0.7255, 0.6902, 0.7176]],

         [[0.2510, 0.2745, 0.3020,  ..., 0.6000, 0.5882, 0.5569],
          [0.2510, 0.2745, 0.2980,  ..., 0.4039, 0.3882, 0.3490],
          [0.2471, 0.2667, 0.2902,  ..., 0.3765, 0.3882, 0.3608],
          ...,
          [0.2667, 0.2392, 0.2706,  ..., 0.6510, 0.6000, 0.6235],
          [0.2863, 0.2549, 0.2863,  ..., 0.6235, 0.5765, 0.6039],
          [0.3020, 0.2745, 0.3020,  ..., 0.6235, 0.5882, 0.6157]]]), caption='Cello Renting vs. Buying: Which is Right for You?')
```

Awesome, it returns a {py:class}`CaptioningSample <megatron.energon.CaptioningSample>` with the attributes

- `__key__`: `part_00123/00403.tar/004030195`, the identifier of the sample like `TAR_FILE/INDEX`
  - All sample types will have a key. It's in the base class {py:class}`Sample <megatron.energon.Sample>` 
- `image`: The image as a tensor of shape `(1, 3, 267, 400)` (RGB image in a batch of size 1)
- `caption`: A list of strings (here just one since batch size is one) 

Let's also talk about the {py:class}`WorkerConfig <megatron.energon.WorkerConfig>`. As energon is made for distributed training,
you always need to provide a worker config to the dataset so specify how many ranks and workers there are and which rank you're currently on.
For this simple tutorial, we don't really distribute the work, so we use only a single rank with 4 workers. Check out the helper method {py:meth}`default_worker_config <megatron.energon.WorkerConfig.default_worker_config>` to see how the worker config is constructed. Also don't be afraid to click the *`[source]`* link and look at the very short source code of it.

## Tutorial 3: Batch Size

Actually, we would like to use a `batch_size` of more than one, let's go with 2 for now.

```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig

loader = get_loader(get_train_dataset(
    '/path/to/your/dataset',
    batch_size=2,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
))

for batch in loader:
    print(batch)
    break
```

The output will be similar to above but with different shapes and lengths:

- `batch.__key__`: A list of two keys
- `batch.image`: Tensor of shape `(2, 3, 267, 400)`
- `batch.caption`: A list of two caption strings

The default [task encoder](task_encoder) automagically padded and stacked the items to a batch.
This may be ok for some cases, but usually you will want to process and batch your data differently.

Hence, we can

- either use an existing task encoder
- or define a custom one (see [](task_encoder))

## Tutorial 3: Blending using Metadataset

A typical use case is to blend multiple datasets of the same (or similar type) together.
For example, you may want to blend the COCO dataset with the COYO dataset.
The easiest way to do this, is to use the metadataset pattern. 
For this you need to create a new `yaml` file that defines the meta dataset:

`coyo-coco-dataset.yaml`:
```yaml
__module__: megatron.energon
__class__: MetadatasetV2
splits:
  # Train dataset, the datasets will be blended according to their weights 
  train:
    blend:
      - weight: 5
        path: ./coco
      - weight: 2
        path: ./coyo
  # For val and test, datasets will be concatenated
  val:
    path: ./coco
  test:
    path: ./coyo
```

This assumes, that the datasets `coyo` and `coco` exist in subfolders next to the `coyo-coco-dataset.yaml` file. You could also use absolute paths, but that will not work well when using object storage such as S3.

Usage in your loader, simply use {py:func}`get_train_dataset <megatron.energon.get_train_dataset>`:
```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig

loader = get_loader(get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
))

for batch in loader:
    print(batch)
    break

```

If you need to handle samples from different datasets differently in your pipeline, you will want to use `subflavors`.
For these and other details, check out the [](metadataset) section. Energon also supports blending by specifying the number of repetitions for each dataset using [](../advanced/epochized_blending).

## Tutorial 4: Distributed Loading

For multi-GPU support, you may need to adapt the worker config.
So far we have only used the default worker config, which you can get by calling {py:func}`WorkerConfig.default_worker_config() <megatron.energon.WorkerConfig.default_worker_config>`.
This default config tries to infer your multi-GPU setup by using `torch.distributed`, which is fine in most cases.
If you are not using any distributed setup, the default config will work, too. In that case, it assumes a single local rank.

However, if you have a more complex multi-node setup and other non-data-parallel stragies, you may need to set it up yourself. 
The following example shows how it could be set.

```python
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
import torch.distributed as dist

worker_config = WorkerConfig(
    rank=SET_YOUR_GLOBAL_DATA_RANK_HERE,
    world_size=SET_YOUR_GLOBAL_WORLD_SIZE_HERE,
    num_workers=2,
)

loader = get_loader(get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=worker_config,
))

for batch in loader:
    print(batch)
    break
```

For a detailed discussion on parallelism, check out [](../advanced/parallelism).

## Tutorial 5: Saving and Restoring

For saving and restoring the state (e.g. when pausing and resuming training), the loader must be instantiated with the savable loader.

```python
from megatron.energon import get_train_dataset, get_savable_loader, WorkerConfig

ds = get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)

# Must use the savable loader here. This provides methods to save
# and load the state of the data loader
loader = get_savable_loader(ds)

for i, batch in zip(range(10), loader):
    print(batch)
    break

# Save the state
state = loader.save_state_rank()
# Could save the state now using torch.save()

# ... when loading:
# Could load the state with torch.load()

# Restore the state for a new loader
ds = get_train_dataset(
    'coyo-coco-dataset.yaml',
    batch_size=4,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)
loader = get_savable_loader(ds)
loader.restore_state_rank(state)
```

We provide code for different scenarios for saving and loading in distributed settings especially in the section [](save_restore).

## More Features

Check out the topics in Advanced Usage for details on specific features.