json to jsonlines, and doc, and typo (#10043)

769948fa · Stas Bekman · GitHub · 8ea412a8 · 769948fa
Unverified Commit 769948fa authored Feb 07, 2021 by Stas Bekman Committed by GitHub Feb 07, 2021
Show whitespace changes
Inline Side-by-side

Showing with 23 additions and 22 deletions

examples/seq2seq/README.md examples/seq2seq/README.md +23 -22

No files found.
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -33,7 +33,9 @@ This directory is in a bit of messy state and is undergoing some cleaning, pleas

 ## New script

-The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (json or csv), then fine-tune one of the architectures above on it.
+The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
+
+For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files

 Here is an example on a summarization task:
 ```bash
@@ -50,15 +52,15 @@ python examples/seq2seq/run_seq2seq.py \
    --predict_with_generate
 ```

-And here is how you would use it on your own files (replace `path_to_csv_or_json_file`, `text_column_name` and `summary_column_name` by the relevant values):
+And here is how you would use it on your own files (replace `path_to_csv_or_jsonlines_file`, `text_column_name` and `summary_column_name` by the relevant values):
 ```bash
 python examples/seq2seq/run_seq2seq.py \
-    -model_name_or_path t5-small \
+    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --task summarization \
-    --train_file path_to_csv_or_json_file \
-    --validation_file path_to_csv_or_json_file \
+    --train_file path_to_csv_or_jsonlines_file \
+    --validation_file path_to_csv_or_jsonlines_file \
    --output_dir ~/tmp/tst-summarization \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
@@ -87,7 +89,7 @@ python examples/seq2seq/run_seq2seq.py \
    --predict_with_generate
 ```

-And here is how you would use it on your own files (replace `path_to_json_file`, by the relevant values):
+And here is how you would use it on your own files (replace `path_to_jsonlines_file`, by the relevant values):
 ```bash
 python examples/seq2seq/run_seq2seq.py \
    --model_name_or_path sshleifer/student_marian_en_ro_6_1 \
@@ -98,15 +100,15 @@ python examples/seq2seq/run_seq2seq.py \
    --dataset_config_name ro-en \
    --source_lang en_XX \
    --target_lang ro_RO\
-    --train_file path_to_json_file \
-    --validation_file path_to_json_file \
+    --train_file path_to_jsonlines_file \
+    --validation_file path_to_jsonlines_file \
    --output_dir ~/tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
 ```
-Here the files are expected to be JSON files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`).
+Here the files are expected to be JSONLINES files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`).

 ## Old script

@@ -417,4 +419,3 @@ uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes.
 The feature is still experimental, because:
 + we can make it much more robust if we have memory mapped/preprocessed datasets.
 + The speedup over sortish sampler is not that large at the moment.
-