Unverified Commit 9cebae64 authored by Phuc Van Phan's avatar Phuc Van Phan Committed by GitHub
Browse files

docs: update link huggingface map (#26077)

parent 7fd2d686
...@@ -122,7 +122,7 @@ Así es como puedes crear una función de preprocesamiento para convertir la lis ...@@ -122,7 +122,7 @@ Así es como puedes crear una función de preprocesamiento para convertir la lis
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) ... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
``` ```
Usa de 🤗 Datasets la función [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) para aplicar la función de preprocesamiento sobre el dataset en su totalidad. Puedes acelerar la función `map` configurando el argumento `batched=True` para procesar múltiples elementos del dataset a la vez y aumentar la cantidad de procesos con `num_proc`. Elimina las columnas que no necesitas: Usa de 🤗 Datasets la función [`map`](https://huggingface.co/docs/datasets/process#map) para aplicar la función de preprocesamiento sobre el dataset en su totalidad. Puedes acelerar la función `map` configurando el argumento `batched=True` para procesar múltiples elementos del dataset a la vez y aumentar la cantidad de procesos con `num_proc`. Elimina las columnas que no necesitas:
```py ```py
>>> tokenized_eli5 = eli5.map( >>> tokenized_eli5 = eli5.map(
......
...@@ -70,7 +70,7 @@ Crie uma função de pré-processamento para tokenizar o campo `text` e truncar ...@@ -70,7 +70,7 @@ Crie uma função de pré-processamento para tokenizar o campo `text` e truncar
... return tokenizer(examples["text"], truncation=True) ... return tokenizer(examples["text"], truncation=True)
``` ```
Use a função [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) do 🤗 Datasets para aplicar a função de pré-processamento em todo o conjunto de dados. Você pode acelerar a função `map` definindo `batched=True` para processar vários elementos do conjunto de dados de uma só vez: Use a função [`map`](https://huggingface.co/docs/datasets/process#map) do 🤗 Datasets para aplicar a função de pré-processamento em todo o conjunto de dados. Você pode acelerar a função `map` definindo `batched=True` para processar vários elementos do conjunto de dados de uma só vez:
```py ```py
tokenized_imdb = imdb.map(preprocess_function, batched=True) tokenized_imdb = imdb.map(preprocess_function, batched=True)
......
...@@ -128,7 +128,7 @@ Aqui está como você pode criar uma função para realinhar os tokens e rótulo ...@@ -128,7 +128,7 @@ Aqui está como você pode criar uma função para realinhar os tokens e rótulo
... return tokenized_inputs ... return tokenized_inputs
``` ```
Use a função [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) do 🤗 Datasets para tokenizar e alinhar os rótulos em todo o conjunto de dados. Você pode acelerar a função `map` configurando `batched=True` para processar vários elementos do conjunto de dados de uma só vez: Use a função [`map`](https://huggingface.co/docs/datasets/process#map) do 🤗 Datasets para tokenizar e alinhar os rótulos em todo o conjunto de dados. Você pode acelerar a função `map` configurando `batched=True` para processar vários elementos do conjunto de dados de uma só vez:
```py ```py
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True) >>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
......
...@@ -684,7 +684,7 @@ def main(): ...@@ -684,7 +684,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
tokenized_datasets = tokenized_datasets.map( tokenized_datasets = tokenized_datasets.map(
group_texts, group_texts,
batched=True, batched=True,
......
...@@ -607,7 +607,7 @@ def main(): ...@@ -607,7 +607,7 @@ def main():
# to preprocess. # to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
lm_datasets = tokenized_datasets.map( lm_datasets = tokenized_datasets.map(
group_texts, group_texts,
......
...@@ -625,7 +625,7 @@ def main(): ...@@ -625,7 +625,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
tokenized_datasets = tokenized_datasets.map( tokenized_datasets = tokenized_datasets.map(
group_texts, group_texts,
batched=True, batched=True,
......
...@@ -715,7 +715,7 @@ def main(): ...@@ -715,7 +715,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
tokenized_datasets = tokenized_datasets.map( tokenized_datasets = tokenized_datasets.map(
group_texts, group_texts,
batched=True, batched=True,
......
...@@ -533,7 +533,7 @@ def main(): ...@@ -533,7 +533,7 @@ def main():
# to preprocess. # to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
with training_args.main_process_first(desc="grouping texts together"): with training_args.main_process_first(desc="grouping texts together"):
if not data_args.streaming: if not data_args.streaming:
......
...@@ -473,7 +473,7 @@ def main(): ...@@ -473,7 +473,7 @@ def main():
# to preprocess. # to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
with accelerator.main_process_first(): with accelerator.main_process_first():
lm_datasets = tokenized_datasets.map( lm_datasets = tokenized_datasets.map(
......
...@@ -547,7 +547,7 @@ def main(): ...@@ -547,7 +547,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
with training_args.main_process_first(desc="grouping texts together"): with training_args.main_process_first(desc="grouping texts together"):
if not data_args.streaming: if not data_args.streaming:
......
...@@ -504,7 +504,7 @@ def main(): ...@@ -504,7 +504,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
with accelerator.main_process_first(): with accelerator.main_process_first():
tokenized_datasets = tokenized_datasets.map( tokenized_datasets = tokenized_datasets.map(
......
...@@ -478,7 +478,7 @@ def main(): ...@@ -478,7 +478,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
with training_args.main_process_first(desc="grouping texts together"): with training_args.main_process_first(desc="grouping texts together"):
tokenized_datasets = tokenized_datasets.map( tokenized_datasets = tokenized_datasets.map(
......
...@@ -395,7 +395,7 @@ def main(): ...@@ -395,7 +395,7 @@ def main():
# to preprocess. # to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
lm_datasets = tokenized_datasets.map( lm_datasets = tokenized_datasets.map(
group_texts, group_texts,
......
...@@ -459,7 +459,7 @@ def main(): ...@@ -459,7 +459,7 @@ def main():
# to preprocess. # to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
lm_datasets = tokenized_datasets.map( lm_datasets = tokenized_datasets.map(
group_texts, group_texts,
......
...@@ -474,7 +474,7 @@ def main(): ...@@ -474,7 +474,7 @@ def main():
# might be slower to preprocess. # might be slower to preprocess.
# #
# To speed up this part, we use multiprocessing. See the documentation of the map method for more information: # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
# https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map # https://huggingface.co/docs/datasets/process#map
tokenized_datasets = tokenized_datasets.map( tokenized_datasets = tokenized_datasets.map(
group_texts, group_texts,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment