"examples/vscode:/vscode.git/clone" did not exist on "74f6f91a9dc944b1f8872a0d22abd60050aa41bc"
Unverified Commit b320d87e authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

Create concept guide section (#16369)

* ✨ create concept guide section

* πŸ– make fixup

* πŸ–

 apply feedback
Co-authored-by: default avatarSteven <stevhliu@gmail.com>
parent ed2ee373
......@@ -5,10 +5,6 @@
title: Quick tour
- local: installation
title: Installation
- local: philosophy
title: Philosophy
- local: glossary
title: Glossary
title: Get started
- sections:
- local: pipeline_tutorial
......@@ -17,30 +13,20 @@
title: Load pretrained instances with an AutoClass
- local: preprocessing
title: Preprocess
- local: task_summary
title: Summary of the tasks
- local: model_summary
title: Summary of the models
- local: training
title: Fine-tune a pretrained model
- local: accelerate
title: Distributed training with πŸ€— Accelerate
- local: model_sharing
title: Share a model
- local: tokenizer_summary
title: Summary of the tokenizers
- local: multilingual
title: Multi-lingual models
title: Tutorials
- sections:
- local: fast_tokenizers
title: "Use tokenizers from πŸ€— Tokenizers"
- local: create_a_model
title: Create a custom model
- local: multilingual
title: Inference for multilingual models
- local: troubleshooting
title: Troubleshoot
- local: custom_datasets
title: Fine-tuning with custom datasets
title: Create a custom architecture
- local: custom_models
title: Sharing custom models
- sections:
- local: tasks/sequence_classification
title: Text classification
......@@ -65,47 +51,59 @@
title: Fine-tune for downstream tasks
- local: run_scripts
title: Train with a script
- local: notebooks
title: "πŸ€— Transformers Notebooks"
- local: sagemaker
title: Run training on Amazon SageMaker
- local: community
title: Community
- local: multilingual
title: Inference for multilingual models
- local: converting_tensorflow_models
title: Converting Tensorflow Checkpoints
title: Converting TensorFlow Checkpoints
- local: serialization
title: Export πŸ€— Transformers models
- local: performance
title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster'
- local: parallelism
title: Model Parallelism
- local: benchmarks
title: Benchmarks
- local: migration
title: Migrating from previous packages
- local: troubleshooting
title: Troubleshoot
- local: debugging
title: Debugging
- local: notebooks
title: "πŸ€— Transformers Notebooks"
- local: community
title: Community
- local: contributing
title: How to contribute to transformers?
- local: add_new_model
title: "How to add a model to πŸ€— Transformers?"
- local: add_new_pipeline
title: "How to add a pipeline to πŸ€— Transformers?"
- local: fast_tokenizers
title: "Using tokenizers from πŸ€— Tokenizers"
- local: performance
title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster'
- local: parallelism
title: Model Parallelism
- local: testing
title: Testing
- local: debugging
title: Debugging
- local: serialization
title: Exporting πŸ€— Transformers models
- local: custom_models
title: Sharing custom models
- local: pr_checks
title: Checks on a Pull Request
title: How-to guides
- sections:
- local: philosophy
title: Philosophy
- local: glossary
title: Glossary
- local: task_summary
title: Summary of the tasks
- local: model_summary
title: Summary of the models
- local: tokenizer_summary
title: Summary of the tokenizers
- local: pad_truncation
title: Padding and truncation
- local: bertology
title: BERTology
- local: perplexity
title: Perplexity of fixed-length models
- local: benchmarks
title: Benchmarks
title: Research
title: Conceptual guides
- sections:
- sections:
- local: main_classes/callback
......
......@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Create a custom model
# Create a custom architecture
An [`AutoClass`](model_doc/auto) automatically infers the model architecture and downloads pretrained configuration and weights. Generally, we recommend using an `AutoClass` to produce checkpoint-agnostic code. But users who want more control over specific model parameters can create a custom πŸ€— Transformers model from just a few base classes. This could be particularly useful for anyone who is interested in studying, training or experimenting with a πŸ€— Transformers model. In this guide, dive deeper into creating a custom model without an `AutoClass`. Learn how to:
......
This diff is collapsed.
......@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Using tokenizers from πŸ€— Tokenizers
# Use tokenizers from πŸ€— Tokenizers
The [`PreTrainedTokenizerFast`] depends on the [πŸ€— Tokenizers](https://huggingface.co/docs/tokenizers) library. The tokenizers obtained from the πŸ€— Tokenizers library can be
loaded very simply into πŸ€— Transformers.
......
......@@ -35,20 +35,17 @@ Each πŸ€— Transformers architecture is defined in a standalone Python module so
The documentation is organized in five parts:
- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
and a glossary.
- **USING πŸ€— TRANSFORMERS** contains general tutorials on how to use the library.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
transformers model
- **API** contains the documentation of each public class and function, grouped in:
- **GET STARTED** contains a quick tour and installation instructions to get up and running with πŸ€— Transformers.
- **TUTORIALS** are a great place to begin if you are new to our library. This section will help you gain the basic skills you need to start using πŸ€— Transformers.
- **HOW-TO GUIDES** will show you how to achieve a specific goal like fine-tuning a pretrained model for language modeling or how to create a custom model head.
- **CONCEPTUAL GUIDES** provides more discussion and explanation of the underlying concepts and ideas behind models, tasks, and the design philosophy of πŸ€— Transformers.
- **API** describes each class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
The library currently contains Jax, PyTorch and Tensorflow implementations, pretrained model weights, usage scripts and
conversion utilities for the following models.
The library currently contains JAX, PyTorch and TensorFlow implementations, pretrained model weights, usage scripts and conversion utilities for the following models.
### Supported models
......
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Padding and truncation
Batched inputs are often different lengths, so they can't be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a special **padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.
In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: `padding`, `truncation` and `max_length`.
The `padding` argument controls padding. It can be a boolean or a string:
- `True` or `'longest'`: pad to the longest sequence in the batch (no padding is applied if you only provide
a single sequence).
- `'max_length'`: pad to a length specified by the `max_length` argument or the maximum length accepted
by the model if no `max_length` is provided (`max_length=None`). Padding will still be applied if you only provide a single sequence.
- `False` or `'do_not_pad'`: no padding is applied. This is the default behavior.
The `truncation` argument controls truncation. It can be a boolean or a string:
- `True` or `'longest_first'`: truncate to a maximum length specified by the `max_length` argument or
the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
truncate token by token, removing a token from the longest sequence in the pair until the proper length is
reached.
- `'only_second'`: truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
- `'only_first'`: truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
- `False` or `'do_not_truncate'`: no truncation is applied. This is the default behavior.
The `max_length` argument controls the length of the padding and truncation. It can be an integer or `None`, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation or padding to `max_length` is deactivated.
The following table summarizes the recommended way to setup padding and truncation. If you use pairs of input sequences in any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation='longest_first'` to control how both sequences in the pair are truncated as detailed before.
| Truncation | Padding | Instruction |
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
| no truncation | no padding | `tokenizer(batch_sentences)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True)` or |
| | | `tokenizer(batch_sentences, padding='longest')` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` |
| | padding to specific length | Not possible |
| truncation to specific length | no padding | `tokenizer(batch_sentences, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` |
| | padding to max model input length | Not possible |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
......@@ -495,64 +495,3 @@ A processor combines a feature extractor and tokenizer. Load a processor with [`
Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.
Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
\ No newline at end of file
## Everything you always wanted to know about padding and truncation
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
truncate to the maximum length the model can accept). However, the API supports more strategies if you need them. The
three arguments you need to know for this are `padding`, `truncation` and `max_length`.
- `padding` controls the padding. It can be a boolean or a string which should be:
- `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
a single sequence).
- `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted
by the model if no `max_length` is provided (`max_length=None`). If you only provide a single sequence,
padding will still be applied to it.
- `False` or `'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
behavior.
- `truncation` controls the truncation. It can be a boolean or a string which should be:
- `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or
the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will
truncate token by token, removing a token from the longest sequence in the pair until the proper length is
reached.
- `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
- `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum
length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate
the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
- `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
default behavior.
- `max_length` to control the length of the padding/truncation. It can be an integer or `None`, in which case
it will default to the maximum length the model can accept. If the model has no specific maximum input length,
truncation/padding to `max_length` is deactivated.
Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
| Truncation | Padding | Instruction |
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
| no truncation | no padding | `tokenizer(batch_sentences)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True)` or |
| | | `tokenizer(batch_sentences, padding='longest')` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` |
| | padding to specific length | Not possible |
| truncation to specific length | no padding | `tokenizer(batch_sentences, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` |
| | padding to max model input length | Not possible |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
......@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Exporting πŸ€— Transformers Models
# Export πŸ€— Transformers Models
If you need to deploy πŸ€— Transformers models in production environments, we
recommend exporting them to a serialized format that can be loaded and executed
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment