dataset_loading.qmd 6.52 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: Dataset Loading
description: Understanding how to load datasets from different sources
back-to-top-navigation: true
toc: true
toc-depth: 5
---

## Overview

Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.

## Loading Datasets

We use the `datasets` library to load datasets and a mix of `load_dataset` and `load_from_disk` to load them.

You may recognize the similar named configs between `load_dataset` and the `datasets` section of the config file.

```yaml
datasets:
  - path:
    name:
    data_files:
    split:
    revision:
    trust_remote_code:
```

::: {.callout-tip}

Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be `path` and sometimes `data_files`.

:::

This matches the API of [`datasets.load_dataset`](https://github.com/huggingface/datasets/blob/0b5998ac62f08e358f8dcc17ec6e2f2a5e9450b6/src/datasets/load.py#L1838-L1858), so if you're familiar with that, you will feel right at home.

For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).

For full details on the config, see [config.qmd](config.qmd).

::: {.callout-note}

You can set multiple datasets in the config file by more than one entry under `datasets`.

```yaml
datasets:
  - path: /path/to/your/dataset
  - path: /path/to/your/other/dataset
```

:::

### Local dataset

#### Files

To load a JSON file, you would do something like this:

```python
from datasets import load_dataset

dataset = load_dataset("json", data_files="data.json")
```

Which translates to the following config:

```yaml
datasets:
  - path: data.json
    ds_type: json
```

In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.

This works for CSV, JSON, Parquet, and Arrow files.

::: {.callout-tip}

If `path` points to a file and `ds_type` is not specified, we will automatically infer the dataset type from the file extension, so you could omit `ds_type` if you'd like.

:::

#### Directory

If you're loading a directory, you can point the `path` to the directory.

Then, you have two options:

##### Loading entire directory

You do not need any additional configs.

We will attempt to load in the following order:
- datasets saved with `datasets.save_to_disk`
- loading entire directory of files (such as with parquet/arrow files)

```yaml
datasets:
  - path: /path/to/your/directory
```

##### Loading specific files in directory

Provide `data_files` with a list of files to load.

```yaml
datasets:
    # single file
  - path: /path/to/your/directory
    ds_type: csv
    data_files: file1.csv

    # multiple files
  - path: /path/to/your/directory
    ds_type: json
    data_files:
      - file1.jsonl
      - file2.jsonl

    # multiple files for parquet
  - path: /path/to/your/directory
    ds_type: parquet
    data_files:
      - file1.parquet
      - file2.parquet

```

### HuggingFace Hub

The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.

::: {.callout-note}

If you're using a private dataset, you will need to enable the `hf_use_auth_token` flag in the root-level of the config file.

:::

#### Folder uploaded

This would mean that the dataset is a single file or file(s) uploaded to the Hub.

```yaml
datasets:
  - path: org/dataset-name
    data_files:
      - file1.jsonl
      - file2.jsonl
```

#### HuggingFace Dataset

This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via `datasets.push_to_hub`.

```yaml
datasets:
  - path: org/dataset-name
```

::: {.callout-note}

There are some other configs which may be required like `name`, `split`, `revision`, `trust_remote_code`, etc depending on the dataset.

:::

### Remote Filesystems

Via the `storage_options` config under `load_dataset`, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.

::: {.callout-warning}

This is currently experimental. Please let us know if you run into any issues!

:::

The only difference between the providers is that you need to prepend the path with the respective protocols.

```yaml
datasets:
    # Single file
  - path: s3://bucket-name/path/to/your/file.jsonl

    # Directory
  - path: s3://bucket-name/path/to/your/directory
```

For directory, we load via `load_from_disk`.

#### S3

Prepend the path with `s3://`.

The credentials are pulled in the following order:

- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables
- from the `~/.aws/credentials` file
- for nodes on EC2, the IAM metadata provider

::: {.callout-note}

We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.

:::

Other environment variables that can be set can be found in [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables)

#### GCS

Prepend the path with `gs://` or `gcs://`.

The credentials are loaded in the following order:

- gcloud credentials
- for nodes on GCP, the google metadata service
- anonymous access

#### Azure

##### Gen 1

Prepend the path with `adl://`.

Ensure you have the following environment variables set:

- `AZURE_STORAGE_TENANT_ID`
- `AZURE_STORAGE_CLIENT_ID`
- `AZURE_STORAGE_CLIENT_SECRET`

##### Gen 2

Prepend the path with `abfs://` or `az://`.

Ensure you have the following environment variables set:

- `AZURE_STORAGE_ACCOUNT_NAME`
- `AZURE_STORAGE_ACCOUNT_KEY`

Other environment variables that can be set can be found in [adlfs docs](https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials)

#### OCI

Prepend the path with `oci://`.

It would attempt to read in the following order:

- `OCIFS_IAM_TYPE`, `OCIFS_CONFIG_LOCATION`, and `OCIFS_CONFIG_PROFILE` environment variables
- when on OCI resource, resource principal

Other environment variables:

- `OCI_REGION_METADATA`

Please see the [ocifs docs](https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables).

### HTTPS

The path should start with `https://`.

```yaml
datasets:
  - path: https://path/to/your/dataset/file.jsonl
```

This must be publically accessible.

## Next steps

Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).