"tests/test_tokenization_realm.py" did not exist on "0ffc8eaf53542092271a208a52e881668e753e72"
Unverified Commit 769948fa authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

json to jsonlines, and doc, and typo (#10043)

parent 8ea412a8
...@@ -33,7 +33,9 @@ This directory is in a bit of messy state and is undergoing some cleaning, pleas ...@@ -33,7 +33,9 @@ This directory is in a bit of messy state and is undergoing some cleaning, pleas
## New script ## New script
The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (json or csv), then fine-tune one of the architectures above on it. The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files
Here is an example on a summarization task: Here is an example on a summarization task:
```bash ```bash
...@@ -50,15 +52,15 @@ python examples/seq2seq/run_seq2seq.py \ ...@@ -50,15 +52,15 @@ python examples/seq2seq/run_seq2seq.py \
--predict_with_generate --predict_with_generate
``` ```
And here is how you would use it on your own files (replace `path_to_csv_or_json_file`, `text_column_name` and `summary_column_name` by the relevant values): And here is how you would use it on your own files (replace `path_to_csv_or_jsonlines_file`, `text_column_name` and `summary_column_name` by the relevant values):
```bash ```bash
python examples/seq2seq/run_seq2seq.py \ python examples/seq2seq/run_seq2seq.py \
-model_name_or_path t5-small \ --model_name_or_path t5-small \
--do_train \ --do_train \
--do_eval \ --do_eval \
--task summarization \ --task summarization \
--train_file path_to_csv_or_json_file \ --train_file path_to_csv_or_jsonlines_file \
--validation_file path_to_csv_or_json_file \ --validation_file path_to_csv_or_jsonlines_file \
--output_dir ~/tmp/tst-summarization \ --output_dir ~/tmp/tst-summarization \
--overwrite_output_dir \ --overwrite_output_dir \
--per_device_train_batch_size=4 \ --per_device_train_batch_size=4 \
...@@ -87,7 +89,7 @@ python examples/seq2seq/run_seq2seq.py \ ...@@ -87,7 +89,7 @@ python examples/seq2seq/run_seq2seq.py \
--predict_with_generate --predict_with_generate
``` ```
And here is how you would use it on your own files (replace `path_to_json_file`, by the relevant values): And here is how you would use it on your own files (replace `path_to_jsonlines_file`, by the relevant values):
```bash ```bash
python examples/seq2seq/run_seq2seq.py \ python examples/seq2seq/run_seq2seq.py \
--model_name_or_path sshleifer/student_marian_en_ro_6_1 \ --model_name_or_path sshleifer/student_marian_en_ro_6_1 \
...@@ -98,15 +100,15 @@ python examples/seq2seq/run_seq2seq.py \ ...@@ -98,15 +100,15 @@ python examples/seq2seq/run_seq2seq.py \
--dataset_config_name ro-en \ --dataset_config_name ro-en \
--source_lang en_XX \ --source_lang en_XX \
--target_lang ro_RO\ --target_lang ro_RO\
--train_file path_to_json_file \ --train_file path_to_jsonlines_file \
--validation_file path_to_json_file \ --validation_file path_to_jsonlines_file \
--output_dir ~/tmp/tst-translation \ --output_dir ~/tmp/tst-translation \
--per_device_train_batch_size=4 \ --per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \ --per_device_eval_batch_size=4 \
--overwrite_output_dir \ --overwrite_output_dir \
--predict_with_generate --predict_with_generate
``` ```
Here the files are expected to be JSON files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`). Here the files are expected to be JSONLINES files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`).
## Old script ## Old script
...@@ -417,4 +419,3 @@ uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes. ...@@ -417,4 +419,3 @@ uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes.
The feature is still experimental, because: The feature is still experimental, because:
+ we can make it much more robust if we have memory mapped/preprocessed datasets. + we can make it much more robust if we have memory mapped/preprocessed datasets.
+ The speedup over sortish sampler is not that large at the moment. + The speedup over sortish sampler is not that large at the moment.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment