Unverified Commit 2e90c3df authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Doc to dataset (#18037)

* Link to the Datasets doc

* Remove unwanted file
parent be79cd7d
...@@ -33,7 +33,7 @@ pip install transformers datasets accelerate nvidia-ml-py3 ...@@ -33,7 +33,7 @@ pip install transformers datasets accelerate nvidia-ml-py3
The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly. The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly.
Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`Dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=dataset#datasets.Dataset) with PyTorch format. Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format.
```py ```py
......
...@@ -244,7 +244,7 @@ For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) data ...@@ -244,7 +244,7 @@ For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) data
'sampling_rate': 8000} 'sampling_rate': 8000}
``` ```
1. Use 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to upsample the sampling rate to 16kHz: 1. Use 🤗 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
```py ```py
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
......
...@@ -117,7 +117,7 @@ The preprocessing function needs to: ...@@ -117,7 +117,7 @@ The preprocessing function needs to:
... return batch ... return batch
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
```py ```py
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4) >>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
......
...@@ -129,7 +129,7 @@ The preprocessing function needs to: ...@@ -129,7 +129,7 @@ The preprocessing function needs to:
... return inputs ... return inputs
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
```py ```py
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True) >>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
......
...@@ -95,7 +95,7 @@ Create a preprocessing function that will apply the transforms and return the `p ...@@ -95,7 +95,7 @@ Create a preprocessing function that will apply the transforms and return the `p
... return examples ... return examples
``` ```
Use 🤗 Dataset's [`with_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?#datasets.Dataset.with_transform) method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset: Use 🤗 Dataset's [`~datasets.Dataset.with_transform`] method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:
```py ```py
>>> food = food.with_transform(transforms) >>> food = food.with_transform(transforms)
......
...@@ -118,7 +118,7 @@ Here is how you can create a preprocessing function to convert the list to a str ...@@ -118,7 +118,7 @@ Here is how you can create a preprocessing function to convert the list to a str
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) ... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:
```py ```py
>>> tokenized_eli5 = eli5.map( >>> tokenized_eli5 = eli5.map(
...@@ -245,7 +245,7 @@ At this point, only three steps remain: ...@@ -245,7 +245,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = lm_dataset["train"].to_tf_dataset( >>> tf_train_set = lm_dataset["train"].to_tf_dataset(
...@@ -352,7 +352,7 @@ At this point, only three steps remain: ...@@ -352,7 +352,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = lm_dataset["train"].to_tf_dataset( >>> tf_train_set = lm_dataset["train"].to_tf_dataset(
......
...@@ -79,7 +79,7 @@ The preprocessing function needs to do: ...@@ -79,7 +79,7 @@ The preprocessing function needs to do:
... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()} ... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
```py ```py
tokenized_swag = swag.map(preprocess_function, batched=True) tokenized_swag = swag.map(preprocess_function, batched=True)
...@@ -224,7 +224,7 @@ At this point, only three steps remain: ...@@ -224,7 +224,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer) >>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
......
...@@ -126,7 +126,7 @@ Here is how you can create a function to truncate and map the start and end toke ...@@ -126,7 +126,7 @@ Here is how you can create a function to truncate and map the start and end toke
... return inputs ... return inputs
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
```py ```py
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names) >>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
...@@ -199,7 +199,7 @@ At this point, only three steps remain: ...@@ -199,7 +199,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset( >>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
......
...@@ -66,7 +66,7 @@ Create a preprocessing function to tokenize `text` and truncate sequences to be ...@@ -66,7 +66,7 @@ Create a preprocessing function to tokenize `text` and truncate sequences to be
... return tokenizer(examples["text"], truncation=True) ... return tokenizer(examples["text"], truncation=True)
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
```py ```py
tokenized_imdb = imdb.map(preprocess_function, batched=True) tokenized_imdb = imdb.map(preprocess_function, batched=True)
...@@ -144,7 +144,7 @@ At this point, only three steps remain: ...@@ -144,7 +144,7 @@ At this point, only three steps remain:
</Tip> </Tip>
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = tokenized_imdb["train"].to_tf_dataset( >>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
......
...@@ -85,7 +85,7 @@ The preprocessing function needs to: ...@@ -85,7 +85,7 @@ The preprocessing function needs to:
... return model_inputs ... return model_inputs
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
```py ```py
>>> tokenized_billsum = billsum.map(preprocess_function, batched=True) >>> tokenized_billsum = billsum.map(preprocess_function, batched=True)
...@@ -160,7 +160,7 @@ At this point, only three steps remain: ...@@ -160,7 +160,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = tokenized_billsum["train"].to_tf_dataset( >>> tf_train_set = tokenized_billsum["train"].to_tf_dataset(
......
...@@ -126,7 +126,7 @@ Here is how you can create a function to realign the tokens and labels, and trun ...@@ -126,7 +126,7 @@ Here is how you can create a function to realign the tokens and labels, and trun
... return tokenized_inputs ... return tokenized_inputs
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once: Use 🤗 Datasets [`~datasets.Dataset.map`] function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
```py ```py
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True) >>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
...@@ -199,7 +199,7 @@ At this point, only three steps remain: ...@@ -199,7 +199,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = tokenized_wnut["train"].to_tf_dataset( >>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
......
...@@ -87,7 +87,7 @@ The preprocessing function needs to: ...@@ -87,7 +87,7 @@ The preprocessing function needs to:
... return model_inputs ... return model_inputs
``` ```
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once: Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
```py ```py
>>> tokenized_books = books.map(preprocess_function, batched=True) >>> tokenized_books = books.map(preprocess_function, batched=True)
...@@ -162,7 +162,7 @@ At this point, only three steps remain: ...@@ -162,7 +162,7 @@ At this point, only three steps remain:
``` ```
</pt> </pt>
<tf> <tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
```py ```py
>>> tf_train_set = tokenized_books["train"].to_tf_dataset( >>> tf_train_set = tokenized_books["train"].to_tf_dataset(
......
...@@ -169,7 +169,7 @@ The [`DefaultDataCollator`] assembles tensors into a batch for the model to trai ...@@ -169,7 +169,7 @@ The [`DefaultDataCollator`] assembles tensors into a batch for the model to trai
</Tip> </Tip>
Next, convert the tokenized datasets to TensorFlow datasets with the [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset) method. Specify your inputs in `columns`, and your label in `label_cols`: Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`:
```py ```py
>>> tf_train_dataset = small_train_dataset.to_tf_dataset( >>> tf_train_dataset = small_train_dataset.to_tf_dataset(
......
...@@ -1189,7 +1189,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu ...@@ -1189,7 +1189,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu
prefetch: bool = True, prefetch: bool = True,
): ):
""" """
Wraps a HuggingFace `datasets.Dataset` as a `tf.data.Dataset` with collation and batching. This method is Wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset` with collation and batching. This method is
designed to create a "ready-to-use" dataset that can be passed directly to Keras methods like `fit()` without designed to create a "ready-to-use" dataset that can be passed directly to Keras methods like `fit()` without
further modification. The method will drop columns from the dataset if they don't match input names for the further modification. The method will drop columns from the dataset if they don't match input names for the
model. If you want to specify the column names to return rather than using the names that match this model, we model. If you want to specify the column names to return rather than using the names that match this model, we
...@@ -1197,7 +1197,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu ...@@ -1197,7 +1197,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu
Args: Args:
dataset (`Any`): dataset (`Any`):
A `datasets.Dataset` to be wrapped as a `tf.data.Dataset`. A [~`datasets.Dataset`] to be wrapped as a `tf.data.Dataset`.
batch_size (`int`, defaults to 8): batch_size (`int`, defaults to 8):
The size of batches to return. The size of batches to return.
shuffle (`bool`, defaults to `True`): shuffle (`bool`, defaults to `True`):
......
...@@ -232,7 +232,7 @@ class Trainer: ...@@ -232,7 +232,7 @@ class Trainer:
default to [`default_data_collator`] if no `tokenizer` is provided, an instance of default to [`default_data_collator`] if no `tokenizer` is provided, an instance of
[`DataCollatorWithPadding`] otherwise. [`DataCollatorWithPadding`] otherwise.
train_dataset (`torch.utils.data.Dataset` or `torch.utils.data.IterableDataset`, *optional*): train_dataset (`torch.utils.data.Dataset` or `torch.utils.data.IterableDataset`, *optional*):
The dataset to use for training. If it is an `datasets.Dataset`, columns not accepted by the The dataset to use for training. If it is a [`~datasets.Dataset`], columns not accepted by the
`model.forward()` method are automatically removed. `model.forward()` method are automatically removed.
Note that if it's a `torch.utils.data.IterableDataset` with some randomization and you are training in a Note that if it's a `torch.utils.data.IterableDataset` with some randomization and you are training in a
...@@ -241,7 +241,7 @@ class Trainer: ...@@ -241,7 +241,7 @@ class Trainer:
manually set the seed of this `generator` at each epoch) or have a `set_epoch()` method that internally manually set the seed of this `generator` at each epoch) or have a `set_epoch()` method that internally
sets the seed of the RNGs used. sets the seed of the RNGs used.
eval_dataset (`torch.utils.data.Dataset`, *optional*): eval_dataset (`torch.utils.data.Dataset`, *optional*):
The dataset to use for evaluation. If it is an `datasets.Dataset`, columns not accepted by the The dataset to use for evaluation. If it is a [`~datasets.Dataset`], columns not accepted by the
`model.forward()` method are automatically removed. `model.forward()` method are automatically removed.
tokenizer ([`PreTrainedTokenizerBase`], *optional*): tokenizer ([`PreTrainedTokenizerBase`], *optional*):
The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the
...@@ -854,8 +854,8 @@ class Trainer: ...@@ -854,8 +854,8 @@ class Trainer:
Args: Args:
eval_dataset (`torch.utils.data.Dataset`, *optional*): eval_dataset (`torch.utils.data.Dataset`, *optional*):
If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted
the `model.forward()` method are automatically removed. It must implement `__len__`. by the `model.forward()` method are automatically removed. It must implement `__len__`.
""" """
if eval_dataset is None and self.eval_dataset is None: if eval_dataset is None and self.eval_dataset is None:
raise ValueError("Trainer: evaluation requires an eval_dataset.") raise ValueError("Trainer: evaluation requires an eval_dataset.")
...@@ -904,8 +904,8 @@ class Trainer: ...@@ -904,8 +904,8 @@ class Trainer:
Args: Args:
test_dataset (`torch.utils.data.Dataset`, *optional*): test_dataset (`torch.utils.data.Dataset`, *optional*):
The test dataset to use. If it is an `datasets.Dataset`, columns not accepted by the `model.forward()` The test dataset to use. If it is a [`~datasets.Dataset`], columns not accepted by the
method are automatically removed. It must implement `__len__`. `model.forward()` method are automatically removed. It must implement `__len__`.
""" """
data_collator = self.data_collator data_collator = self.data_collator
...@@ -2605,8 +2605,8 @@ class Trainer: ...@@ -2605,8 +2605,8 @@ class Trainer:
Args: Args:
eval_dataset (`Dataset`, *optional*): eval_dataset (`Dataset`, *optional*):
Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not Pass a dataset if you wish to override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns
accepted by the `model.forward()` method are automatically removed. It must implement the `__len__` not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
method. method.
ignore_keys (`Lst[str]`, *optional*): ignore_keys (`Lst[str]`, *optional*):
A list of keys in the output of your model (if it is a dictionary) that should be ignored when A list of keys in the output of your model (if it is a dictionary) that should be ignored when
......
...@@ -45,8 +45,8 @@ class Seq2SeqTrainer(Trainer): ...@@ -45,8 +45,8 @@ class Seq2SeqTrainer(Trainer):
Args: Args:
eval_dataset (`Dataset`, *optional*): eval_dataset (`Dataset`, *optional*):
Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns
accepted by the `model.forward()` method are automatically removed. It must implement the `__len__` not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
method. method.
ignore_keys (`List[str]`, *optional*): ignore_keys (`List[str]`, *optional*):
A list of keys in the output of your model (if it is a dictionary) that should be ignored when A list of keys in the output of your model (if it is a dictionary) that should be ignored when
...@@ -93,7 +93,7 @@ class Seq2SeqTrainer(Trainer): ...@@ -93,7 +93,7 @@ class Seq2SeqTrainer(Trainer):
Args: Args:
test_dataset (`Dataset`): test_dataset (`Dataset`):
Dataset to run the predictions on. If it is an `datasets.Dataset`, columns not accepted by the Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the
`model.forward()` method are automatically removed. Has to implement the method `__len__` `model.forward()` method are automatically removed. Has to implement the method `__len__`
ignore_keys (`List[str]`, *optional*): ignore_keys (`List[str]`, *optional*):
A list of keys in the output of your model (if it is a dictionary) that should be ignored when A list of keys in the output of your model (if it is a dictionary) that should be ignored when
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment