Unverified Commit 7bd86505 authored by Maria Khalusova's avatar Maria Khalusova Committed by GitHub
Browse files

Example of pad_to_multiple_of for padding and truncation guide & docstring update (#22278)

* added an example of pad_to_multiple_of

* make style

* addressed feedback
parent fb0a38b4
...@@ -50,6 +50,7 @@ The following table summarizes the recommended way to setup padding and truncati ...@@ -50,6 +50,7 @@ The following table summarizes the recommended way to setup padding and truncati
| | | `tokenizer(batch_sentences, padding='longest')` | | | | `tokenizer(batch_sentences, padding='longest')` |
| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` | | | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | | | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` |
| | padding to a multiple of a value | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8) |
| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or | | truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or |
| | | `tokenizer(batch_sentences, truncation=STRATEGY)` | | | | `tokenizer(batch_sentences, truncation=STRATEGY)` |
| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or | | | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or |
......
...@@ -1342,8 +1342,9 @@ ENCODE_KWARGS_DOCSTRING = r""" ...@@ -1342,8 +1342,9 @@ ENCODE_KWARGS_DOCSTRING = r"""
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification. which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (`int`, *optional*): pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable If set will pad the sequence to a multiple of the provided value. Requires `padding` to be activated.
the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta). This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta).
return_tensors (`str` or [`~utils.TensorType`], *optional*): return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are: If set, will return tensors instead of list of python integers. Acceptable values are:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment