"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "d522afea1324b8156c929f3896df14762c9ea716"
Unverified Commit 1f6885ba authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

add dataset (#20005)

parent 4f1e5e4e
...@@ -432,19 +432,30 @@ Depending on your task, you'll typically pass the following parameters to [`Trai ...@@ -432,19 +432,30 @@ Depending on your task, you'll typically pass the following parameters to [`Trai
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
``` ```
4. Your preprocessed train and test datasets: 4. Load a dataset:
```py ```py
>>> train_dataset = dataset["train"] # doctest: +SKIP >>> from datasets import load_dataset
>>> eval_dataset = dataset["eval"] # doctest: +SKIP
>>> dataset = load_dataset("rottten_tomatoes")
```
5. Create a function to tokenize the dataset, and apply it over the entire dataset with [`~datasets.Dataset.map`]:
```py
>>> def tokenize_dataset(dataset):
... return tokenizer(dataset["text"])
>>> dataset = dataset.map(tokenize_dataset, batched=True)
``` ```
5. A [`DataCollator`] to create a batch of examples from your dataset: 6. A [`DataCollatorWithPadding`] to create a batch of examples from your dataset:
```py ```py
>>> from transformers import DefaultDataCollator >>> from transformers import DataCollatorWithPadding
>>> data_collator = DefaultDataCollator() >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
``` ```
Now gather all these classes in [`Trainer`]: Now gather all these classes in [`Trainer`]:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment