remote_dataset.md 1.9 KB
Newer Older
maming's avatar
maming committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
<!--- Copyright (c) 2025, NVIDIA CORPORATION.
SPDX-License-Identifier: BSD-3-Clause -->

# Remote Dataset

Megatron Energon supports the use of remote datasets. Since version >5.2.0, Energon file access is based on [Multi Storage Client (MSC)](https://github.com/NVIDIA/multi-storage-client).
This means you can train or validate with your data right from any storage by simply swapping the dataset path for a so-called _MSC URL_.

## Prerequisites

For using a remote dataset, install energon with one or more of the extras:
* `s3`
* `aistore`
* `azure-blob-storage`
* `google-cloud-storage`
* `oci`

like this:
```sh
pip install megatron-energon[s3,oci]
```

Set up the msc config as described in [Multi Storage Client documentation](https://nvidia.github.io/multi-storage-client/).
You can also use the rclone config with msc, as was described prior to 5.2.0.

For fast data loading we recommend to activate MSC local caching:

```yaml
cache:
  size: 500G
  use_etag: true
  eviction_policy:
    policy: "fifo"
    refresh_interval: 3600
  cache_backend:
    cache_path: /tmp/msc_cache # prefer to use local NVMe, but Lustre path also works
```

And point MSC to the config with 

```sh
export MSC_CONFIG=/path/to/msc_config.yaml
```


## The URL syntax

The syntax is a simple as 

```
msc://CONFIG_NAME/PATH
```

For example:

```
msc://coolstore/mainbucket/datasets/somedata
```

You can use this URL instead of paths to datasets in

* Functions like `get_train_dataset`, `get_val_dataset`
* Inside [metadataset](../basic/metadataset) specifications
* As arguments to `energon prepare` or `energon lint`. Note that those may be slow for remote locations.
* Or as a path to [`energon mount`](energon-mount) to locally inspect your remote dataset 😎

Example usage:

```python
ds = get_train_dataset(
    'msc://coolstore/mainbucket/datasets/somedata',
    batch_size=1,
    shuffle_buffer_size=100,
    max_samples_per_sequence=100,
)
```