dataset_loading.qmd

---
title: Dataset Loading
description: Understanding how to load datasets from different sources
back-to-top-navigation: true
toc: true
toc-depth: 5
---

## Overview

Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.

## Loading Datasets

We use the `datasets` library to load datasets and a mix of `load_dataset` and `load_from_disk` to load them.

You may recognize the similar named configs between `load_dataset` and the `datasets` section of the config file.

```yaml
datasets:
  - path:
    name:
    data_files:
    split:
    revision:
    trust_remote_code:
```

::: {.callout-tip}

Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be `path` and sometimes `data_files`.

:::

This matches the API of [`datasets.load_dataset`](https://github.com/huggingface/datasets/blob/0b5998ac62f08e358f8dcc17ec6e2f2a5e9450b6/src/datasets/load.py#L1838-L1858), so if you're familiar with that, you will feel right at home.

For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).

For full details on the config, see [config.qmd](config.qmd).

::: {.callout-note}

You can set multiple datasets in the config file by more than one entry under `datasets`.

```yaml
datasets:
  - path: /path/to/your/dataset
  - path: /path/to/your/other/dataset
```

:::

### Local dataset

#### Files

To load a JSON file, you would do something like this:

```python
from datasets import load_dataset

dataset = load_dataset("json", data_files="data.json")
```

Which translates to the following config:

```yaml
datasets:
  - path: data.json
    ds_type: json
```

In the example above, it can be seen that we can just point the `path` to the file or directory along with the `ds_type` to load the dataset.

This works for CSV, JSON, Parquet, and Arrow files.

::: {.callout-tip}

If `path` points to a file and `ds_type` is not specified, we will automatically infer the dataset type from the file extension, so you could omit `ds_type` if you'd like.

:::

#### Directory

If you're loading a directory, you can point the `path` to the directory.

Then, you have two options:

##### Loading entire directory

You do not need any additional configs.

We will attempt to load in the following order:
- datasets saved with `datasets.save_to_disk`
- loading entire directory of files (such as with parquet/arrow files)

```yaml
datasets:
  - path: /path/to/your/directory
```

##### Loading specific files in directory

Provide `data_files` with a list of files to load.

```yaml
datasets:
    # single file
  - path: /path/to/your/directory
    ds_type: csv
    data_files: file1.csv

    # multiple files
  - path: /path/to/your/directory
    ds_type: json
    data_files:
      - file1.jsonl
      - file2.jsonl

    # multiple files for parquet
  - path: /path/to/your/directory
    ds_type: parquet
    data_files:
      - file1.parquet
      - file2.parquet

```

### HuggingFace Hub

The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.

::: {.callout-note}

If you're using a private dataset, you will need to enable the `hf_use_auth_token` flag in the root-level of the config file.

:::

#### Folder uploaded

This would mean that the dataset is a single file or file(s) uploaded to the Hub.

```yaml
datasets:
  - path: org/dataset-name
    data_files:
      - file1.jsonl
      - file2.jsonl
```

#### HuggingFace Dataset

This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via `datasets.push_to_hub`.

```yaml
datasets:
  - path: org/dataset-name
```

::: {.callout-note}

There are some other configs which may be required like `name`, `split`, `revision`, `trust_remote_code`, etc depending on the dataset.

:::

### Remote Filesystems

Via the `storage_options` config under `load_dataset`, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.

::: {.callout-warning}

This is currently experimental. Please let us know if you run into any issues!

:::

The only difference between the providers is that you need to prepend the path with the respective protocols.

```yaml
datasets:
    # Single file
  - path: s3://bucket-name/path/to/your/file.jsonl

    # Directory
  - path: s3://bucket-name/path/to/your/directory
```

For directory, we load via `load_from_disk`.

#### S3

Prepend the path with `s3://`.

The credentials are pulled in the following order:

- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables
- from the `~/.aws/credentials` file
- for nodes on EC2, the IAM metadata provider

::: {.callout-note}

We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.

:::

Other environment variables that can be set can be found in [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables)

#### GCS

Prepend the path with `gs://` or `gcs://`.

The credentials are loaded in the following order:

- gcloud credentials
- for nodes on GCP, the google metadata service
- anonymous access

#### Azure

##### Gen 1

Prepend the path with `adl://`.

Ensure you have the following environment variables set:

- `AZURE_STORAGE_TENANT_ID`
- `AZURE_STORAGE_CLIENT_ID`
- `AZURE_STORAGE_CLIENT_SECRET`

##### Gen 2

Prepend the path with `abfs://` or `az://`.

Ensure you have the following environment variables set:

- `AZURE_STORAGE_ACCOUNT_NAME`
- `AZURE_STORAGE_ACCOUNT_KEY`

Other environment variables that can be set can be found in [adlfs docs](https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials)

#### OCI

Prepend the path with `oci://`.

It would attempt to read in the following order:

- `OCIFS_IAM_TYPE`, `OCIFS_CONFIG_LOCATION`, and `OCIFS_CONFIG_PROFILE` environment variables
- when on OCI resource, resource principal

Other environment variables:

- `OCI_REGION_METADATA`

Please see the [ocifs docs](https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables).

### HTTPS

The path should start with `https://`.

```yaml
datasets:
  - path: https://path/to/your/dataset/file.jsonl
```

This must be publically accessible.

## Next steps

Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).