Unverified Commit 31e3c6c3 authored by PD Hall's avatar PD Hall Committed by GitHub
Browse files

docs: improve clarity for language modeling (#21952)

* docs: improve clarity for clm/mlm

* docs: remove incorrect explanation

* docs: remove incorrect explanation

---------

Co-authored-by: pdhall99 <pdhall99>
parent 0ce5236d
...@@ -127,14 +127,14 @@ extract the `text` subfield from its nested structure with the [`flatten`](https ...@@ -127,14 +127,14 @@ extract the `text` subfield from its nested structure with the [`flatten`](https
Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead
of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them. of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them.
Here is how you can create a preprocessing function to convert the list to a string, and truncate sequences to be no longer than DistilGPT2's maximum input length: Here is a first preprocessing function to join the list of strings for each example and tokenize the result:
```py ```py
>>> def preprocess_function(examples): >>> def preprocess_function(examples):
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) ... return tokenizer([" ".join(x) for x in examples["answers.text"]])
``` ```
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.with_transform`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need: To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need:
```py ```py
>>> tokenized_eli5 = eli5.map( >>> tokenized_eli5 = eli5.map(
...@@ -145,19 +145,25 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ ...@@ -145,19 +145,25 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [
... ) ... )
``` ```
Now you'll need a second preprocessing function to capture text truncated from the lengthier examples to avoid losing any information. This preprocessing function should: This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.
- Concatenate all the text. You can now use a second preprocessing function to
- Split the concatenated text into smaller chunks defined by `block_size`. - concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM.
```py ```py
>>> block_size = 128 >>> block_size = 128
>>> def group_texts(examples): >>> def group_texts(examples):
... # Concatenate all texts.
... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} ... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
... total_length = len(concatenated_examples[list(examples.keys())[0]]) ... total_length = len(concatenated_examples[list(examples.keys())[0]])
... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
... # customize this part to your needs.
... if total_length >= block_size:
... total_length = (total_length // block_size) * block_size ... total_length = (total_length // block_size) * block_size
... # Split by chunks of block_size.
... result = { ... result = {
... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] ... k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
... for k, t in concatenated_examples.items() ... for k, t in concatenated_examples.items()
......
...@@ -123,14 +123,14 @@ xtract the `text` subfield from its nested structure with the [`flatten`](https: ...@@ -123,14 +123,14 @@ xtract the `text` subfield from its nested structure with the [`flatten`](https:
Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead
of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them. of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them.
Here is how you can create a preprocessing function to convert the list to a string, and truncate sequences to be no longer than DistilRoBERTa's maximum input length: Here is a first preprocessing function to join the list of strings for each example and tokenize the result:
```py ```py
>>> def preprocess_function(examples): >>> def preprocess_function(examples):
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) ... return tokenizer([" ".join(x) for x in examples["answers.text"]])
``` ```
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.with_transform`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need: To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need:
```py ```py
>>> tokenized_eli5 = eli5.map( >>> tokenized_eli5 = eli5.map(
...@@ -141,19 +141,25 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ ...@@ -141,19 +141,25 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [
... ) ... )
``` ```
Now you'll need a second preprocessing function to capture text truncated from the lengthier examples to avoid losing any information. This preprocessing function should: This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.
- Concatenate all the text. You can now use a second preprocessing function to
- Split the concatenated text into smaller chunks defined by `block_size`. - concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM.
```py ```py
>>> block_size = 128 >>> block_size = 128
>>> def group_texts(examples): >>> def group_texts(examples):
... # Concatenate all texts.
... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} ... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
... total_length = len(concatenated_examples[list(examples.keys())[0]]) ... total_length = len(concatenated_examples[list(examples.keys())[0]])
... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
... # customize this part to your needs.
... if total_length >= block_size:
... total_length = (total_length // block_size) * block_size ... total_length = (total_length // block_size) * block_size
... # Split by chunks of block_size.
... result = { ... result = {
... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] ... k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
... for k, t in concatenated_examples.items() ... for k, t in concatenated_examples.items()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment