create_dataset.md 4.06 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
# Create a dataset for training

There are many datasets on the [Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) to train a model on, but if you can't find one you're interested in or want to use your own, you can create a dataset with the 🤗 [Datasets](hf.co/docs/datasets) library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation.

This guide will show you two ways to create a dataset to finetune on:

- provide a folder of images to the `--train_data_dir` argument
- upload a dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument

<Tip>

12
💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide.
Steven Liu's avatar
Steven Liu committed
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

</Tip>

## Provide a dataset as a folder

For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) builder from 🤗 Datasets to automatically build a dataset from the folder. Your directory structure should look like:

```bash
data_dir/xxx.png
data_dir/xxy.png
data_dir/[...]/xxz.png
```

Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training:

```bash
accelerate launch train_unconditional.py \
    --train_data_dir <path-to-train-directory> \
    <other-arguments>
```

## Upload your data to the Hub

<Tip>

💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with 🤗 Datasets](https://huggingface.co/blog/image-search-datasets) post.

</Tip>

42
Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images.
Steven Liu's avatar
Steven Liu committed
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`:

```python
from datasets import load_dataset

# example 1: local folder
dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")

# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset("imagefolder", data_files="path_to_zip_file")

# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset(
    "imagefolder",
    data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
)

# example 4: providing several splits
dataset = load_dataset(
    "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
)
```

Then use the [`~datasets.Dataset.push_to_hub`] method to upload the dataset to the Hub:

```python
# assuming you have ran the huggingface-cli login command in a terminal
dataset.push_to_hub("name_of_your_dataset")

# if you want to push to a private repo, simply pass private=True:
dataset.push_to_hub("name_of_your_dataset", private=True)
```

Now the dataset is available for training by passing the dataset name to the `--dataset_name` argument:

```bash
accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
81
  --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
Steven Liu's avatar
Steven Liu committed
82
83
84
85
86
87
88
89
  --dataset_name="name_of_your_dataset" \
  <other-arguments>
```

## Next steps

Now that you've created a dataset, you can plug it into the `train_data_dir` (if your dataset is local) or `dataset_name` (if your dataset is on the Hub) arguments of a training script.

90
For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](unconditional_training) or [text-to-image generation](text2image)!