Rework documentation for explaining local dataset (#1284)

* rewor documentation for explaining local dataset * fix typo * Update new_task_guide.md

Rework documentation for explaining local dataset (#1284)
* rewor documentation for explaining local dataset * fix typo * Update new_task_guide.md
b074ccb6 · Lintang Sutawika · GitHub · ef665088 · b074ccb6
Unverified Commit b074ccb6 authored Jan 15, 2024 by Lintang Sutawika Committed by GitHub Jan 15, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 30 additions and 10 deletions

docs/new_task_guide.md docs/new_task_guide.md +30 -10

No files found.
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -46,16 +46,6 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas
 dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
 ```

------------------------------
-**Tip:** To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
-```
-dataset_path: json
-dataset_name: null
-dataset_kwargs:
-  data_files: /path/to/my/json
-```
-------------------------------
-
 Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:

 ```yaml
@@ -99,6 +89,36 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
 process_docs: !function utils.process_docs
 ```

+### Using Local Datasets
+
+To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
+
+```
+dataset_path: json
+dataset_name: null
+dataset_kwargs:
+  data_files: /path/to/my/json
+```
+Or with files already split into separate directories:
+
+```
+dataset_path: arrow
+dataset_kwargs:
+  data_files:
+    train: /path/to/arrow/train/data-00000-of-00001.arrow
+    validation: /path/to/arrow/validation/data-00000-of-00001.arrow
+```
+
+Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
+
+```
+dataset_path: hellaswag
+dataset_kwargs:
+  data_dir: hellaswag_local/
+```
+
+You can also set `dataset_path` as a directory path in your local system. This will assume that there is a loading script with the same name as the directory. [See datasets docs](https://huggingface.co/docs/datasets/loading#local-loading-script).
+
 ## Writing a Prompt Template

 The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.