Unverified Commit 2749e479 authored by Klaus Hipp's avatar Klaus Hipp Committed by GitHub
Browse files

[Docs] Fix broken links and syntax issues (#28918)

* Fix model documentation links in attention.md

* Fix external link syntax

* Fix target anchor names of section links

* Fix copyright statement comments

* Fix documentation headings
parent d6286646
...@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") ...@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
**7. Implementieren Sie den Vorwärtspass** **7. Implementieren Sie den Vorwärtspass**
Nachdem es Ihnen gelungen ist, die trainierten Gewichte korrekt in die 🤗 Transformers-Implementierung zu laden, sollten Sie nun dafür sorgen Nachdem es Ihnen gelungen ist, die trainierten Gewichte korrekt in die 🤗 Transformers-Implementierung zu laden, sollten Sie nun dafür sorgen
sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprünglichen Repository vertraut](#34-run-a-pretrained-checkpoint-using-the-original-repository) haben Sie bereits ein Skript erstellt, das einen Forward Pass sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprünglichen Repository vertraut](#3-4-führen-sie-einen-pre-training-checkpoint-mit-dem-original-repository-durch) haben Sie bereits ein Skript erstellt, das einen Forward Pass
Durchlauf des Modells unter Verwendung des Original-Repositorys durchführt. Jetzt sollten Sie ein analoges Skript schreiben, das die 🤗 Transformers Durchlauf des Modells unter Verwendung des Original-Repositorys durchführt. Jetzt sollten Sie ein analoges Skript schreiben, das die 🤗 Transformers
Implementierung anstelle der Originalimplementierung verwenden. Es sollte wie folgt aussehen: Implementierung anstelle der Originalimplementierung verwenden. Es sollte wie folgt aussehen:
......
...@@ -83,7 +83,7 @@ Sie sich nicht auf eine bestimmte Architektur festgelegt haben, ist es eine gute ...@@ -83,7 +83,7 @@ Sie sich nicht auf eine bestimmte Architektur festgelegt haben, ist es eine gute
Wir werden Sie zu den wichtigsten Architekturen führen, die auf der TensorFlow-Seite noch fehlen. Wir werden Sie zu den wichtigsten Architekturen führen, die auf der TensorFlow-Seite noch fehlen.
Seite fehlen. Wenn das spezifische Modell, das Sie mit TensorFlow verwenden möchten, bereits eine Implementierung der TensorFlow-Architektur in Seite fehlen. Wenn das spezifische Modell, das Sie mit TensorFlow verwenden möchten, bereits eine Implementierung der TensorFlow-Architektur in
🤗 Transformers, aber es fehlen Gewichte, können Sie direkt in den 🤗 Transformers, aber es fehlen Gewichte, können Sie direkt in den
Abschnitt [Gewichtskonvertierung](#adding-tensorflow-weights-to-hub) Abschnitt [Gewichtskonvertierung](#hinzufügen-von-tensorflow-gewichten-zum--hub)
auf dieser Seite. auf dieser Seite.
Der Einfachheit halber wird im Rest dieser Anleitung davon ausgegangen, dass Sie sich entschieden haben, mit der TensorFlow-Version von Der Einfachheit halber wird im Rest dieser Anleitung davon ausgegangen, dass Sie sich entschieden haben, mit der TensorFlow-Version von
......
...@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") ...@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
**7. Implement the forward pass** **7. Implement the forward pass**
Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make
sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#34-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#3-4-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward
pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers
implementation instead of the original one. It should look as follows: implementation instead of the original one. It should look as follows:
......
...@@ -83,7 +83,7 @@ don't have your eyes set on a specific architecture, asking the 🤗 Transformer ...@@ -83,7 +83,7 @@ don't have your eyes set on a specific architecture, asking the 🤗 Transformer
maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow
side. If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in side. If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in
🤗 Transformers but is lacking weights, feel free to jump straight into the 🤗 Transformers but is lacking weights, feel free to jump straight into the
[weight conversion section](#adding-tensorflow-weights-to-hub) [weight conversion section](#adding-tensorflow-weights-to--hub)
of this page. of this page.
For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of
......
...@@ -22,7 +22,7 @@ use a sparse version of the attention matrix to speed up training. ...@@ -22,7 +22,7 @@ use a sparse version of the attention matrix to speed up training.
## LSH attention ## LSH attention
[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax [Reformer](model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
modified to mask the current token (except at the first position), because it will give a query and a key equal (so modified to mask the current token (except at the first position), because it will give a query and a key equal (so
...@@ -31,7 +31,7 @@ very similar to each other). Since the hash can be a bit random, several hash fu ...@@ -31,7 +31,7 @@ very similar to each other). Since the hash can be a bit random, several hash fu
## Local attention ## Local attention
[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the [Longformer](model_doc/longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
representation of the whole sentence. representation of the whole sentence.
...@@ -51,7 +51,7 @@ length. ...@@ -51,7 +51,7 @@ length.
### Axial positional encodings ### Axial positional encodings
[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding [Reformer](model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
......
...@@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw ...@@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`]. * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`]. * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`]. * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
## I ## I
...@@ -422,7 +422,7 @@ Models that generate a new sequence from an input, like translation models, or s ...@@ -422,7 +422,7 @@ Models that generate a new sequence from an input, like translation models, or s
### Sharded DDP ### Sharded DDP
Another name for the foundational [ZeRO](#zero-redundancy-optimizer--zero-) concept as used by various other implementations of ZeRO. Another name for the foundational [ZeRO](#zero-redundancy-optimizer-zero) concept as used by various other implementations of ZeRO.
### stride ### stride
......
<!--Copyright 2020 The HuggingFace Team. All rights reserved. <!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at the License. You may obtain a copy of the License at
......
...@@ -29,7 +29,7 @@ alt="drawing" width="600"/> ...@@ -29,7 +29,7 @@ alt="drawing" width="600"/>
<small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small> <small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small>
MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE). MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR). This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).
## Inference example ## Inference example
......
...@@ -26,7 +26,7 @@ The abstract from the paper is the following: ...@@ -26,7 +26,7 @@ The abstract from the paper is the following:
*While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.* *While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.*
This model was contributed by [zphang](<https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus). This model was contributed by [zphang](https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).
## Documentation resources ## Documentation resources
......
...@@ -38,7 +38,7 @@ object detection, instance and semantic segmentation. For example, with a compar ...@@ -38,7 +38,7 @@ object detection, instance and semantic segmentation. For example, with a compar
achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope
that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.* that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.*
This model was contributed by [Xrenya](<https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT). This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).
- PVTv1 on ImageNet-1K - PVTv1 on ImageNet-1K
......
...@@ -60,7 +60,7 @@ for summarization: *summarize: ...*. ...@@ -60,7 +60,7 @@ for summarization: *summarize: ...*.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right. - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
- See the [training](#training), [inference](#inference) and [scripts](#scripts) sections below for all details regarding usage. - See the [training](#training), [inference](#inference) and [resources](#resources) sections below for all details regarding usage.
T5 comes in different sizes: T5 comes in different sizes:
......
...@@ -51,7 +51,7 @@ The methods and tools covered in this guide can be classified based on the effec ...@@ -51,7 +51,7 @@ The methods and tools covered in this guide can be classified based on the effec
| [Data preloading](#data-preloading) | Yes | No | | [Data preloading](#data-preloading) | Yes | No |
| [DeepSpeed Zero](#deepspeed-zero) | No | Yes | | [DeepSpeed Zero](#deepspeed-zero) | No | Yes |
| [torch.compile](#using-torchcompile) | Yes | No | | [torch.compile](#using-torchcompile) | Yes | No |
| [Parameter-Efficient Fine Tuning (PEFT)](#peft) | No | Yes | | [Parameter-Efficient Fine Tuning (PEFT)](#using--peft) | No | Yes |
<Tip> <Tip>
...@@ -62,12 +62,12 @@ large model and a small batch size, the memory use will be larger. ...@@ -62,12 +62,12 @@ large model and a small batch size, the memory use will be larger.
You can combine the above methods to get a cumulative effect. These techniques are available to you whether you are You can combine the above methods to get a cumulative effect. These techniques are available to you whether you are
training your model with [`Trainer`] or writing a pure PyTorch loop, in which case you can [configure these optimizations training your model with [`Trainer`] or writing a pure PyTorch loop, in which case you can [configure these optimizations
with 🤗 Accelerate](#using-accelerate). with 🤗 Accelerate](#using--accelerate).
If these methods do not result in sufficient gains, you can explore the following options: If these methods do not result in sufficient gains, you can explore the following options:
* [Look into building your own custom Docker container with efficient softare prebuilds](#efficient-software-prebuilds) * [Look into building your own custom Docker container with efficient softare prebuilds](#efficient-software-prebuilds)
* [Consider a model that uses Mixture of Experts (MoE)](#mixture-of-experts) * [Consider a model that uses Mixture of Experts (MoE)](#mixture-of-experts)
* [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention) * [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention-and-flash-attention)
Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving
to a multi-GPU setup. All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism to a multi-GPU setup. All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism
...@@ -110,7 +110,7 @@ training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumu ...@@ -110,7 +110,7 @@ training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumu
In the above example, your effective batch size becomes 4. In the above example, your effective batch size becomes 4.
Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example
[further down in this guide](#using-accelerate). [further down in this guide](#using--accelerate).
While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can
result in a more pronounced training slowdown. Consider the following example. Let's say, the `per_device_train_batch_size=4` result in a more pronounced training slowdown. Consider the following example. Let's say, the `per_device_train_batch_size=4`
...@@ -143,7 +143,7 @@ training_args = TrainingArguments( ...@@ -143,7 +143,7 @@ training_args = TrainingArguments(
) )
``` ```
Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using-accelerate). Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using--accelerate).
<Tip> <Tip>
...@@ -179,7 +179,7 @@ To enable mixed precision training, set the `fp16` flag to `True`: ...@@ -179,7 +179,7 @@ To enable mixed precision training, set the `fp16` flag to `True`:
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
``` ```
If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using-accelerate). If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using--accelerate).
### BF16 ### BF16
......
...@@ -214,7 +214,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut ...@@ -214,7 +214,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut
<Tip warning={true}> <Tip warning={true}>
Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m]() model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists. Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
</Tip> </Tip>
......
...@@ -36,7 +36,7 @@ being a large model means it requires significant computational resources and in ...@@ -36,7 +36,7 @@ being a large model means it requires significant computational resources and in
this approach suits your use case better than fine-tuning specialized models for each individual task. this approach suits your use case better than fine-tuning specialized models for each individual task.
In this guide, you'll learn how to: In this guide, you'll learn how to:
- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#loading-the-quantized-version-of-the-model) - [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model)
- Use IDEFICS for: - Use IDEFICS for:
- [Image captioning](#image-captioning) - [Image captioning](#image-captioning)
- [Prompted image captioning](#prompted-image-captioning) - [Prompted image captioning](#prompted-image-captioning)
......
...@@ -35,7 +35,7 @@ practices that help to achieve optimal results more consistently. ...@@ -35,7 +35,7 @@ practices that help to achieve optimal results more consistently.
This guide covers the prompt engineering best practices to help you craft better LLM prompts and solve various NLP tasks. This guide covers the prompt engineering best practices to help you craft better LLM prompts and solve various NLP tasks.
You'll learn: You'll learn:
- [Basics of prompting](#basic-prompts) - [Basics of prompting](#basics-of-prompting)
- [Best practices of LLM prompting](#best-practices-of-llm-prompting) - [Best practices of LLM prompting](#best-practices-of-llm-prompting)
- [Advanced prompting techniques: few-shot prompting and chain-of-thought](#advanced-prompting-techniques) - [Advanced prompting techniques: few-shot prompting and chain-of-thought](#advanced-prompting-techniques)
- [When to fine-tune instead of prompting](#prompting-vs-fine-tuning) - [When to fine-tune instead of prompting](#prompting-vs-fine-tuning)
......
...@@ -165,7 +165,7 @@ La cabecera del modelo se refiere a la última capa de una red neuronal que acep ...@@ -165,7 +165,7 @@ La cabecera del modelo se refiere a la última capa de una red neuronal que acep
* [`GPT2ForSequenceClassification`] es una cabecera de clasificación de secuencias, es decir, una capa lineal, encima del modelo base [`GPT2Model`]. * [`GPT2ForSequenceClassification`] es una cabecera de clasificación de secuencias, es decir, una capa lineal, encima del modelo base [`GPT2Model`].
* [`ViTForImageClassification`] es una cabecera de clasificación de imágenes, es decir, una capa lineal encima del estado oculto final del token `CLS`, encima del modelo base [`ViTModel`]. * [`ViTForImageClassification`] es una cabecera de clasificación de imágenes, es decir, una capa lineal encima del estado oculto final del token `CLS`, encima del modelo base [`ViTModel`].
* [`Wav2Vec2ForCTC`] es una cabecera de modelado de lenguaje con [CTC](#connectionist-temporal-classification-(CTC)) encima del modelo base [`Wav2Vec2Model`]. * [`Wav2Vec2ForCTC`] es una cabecera de modelado de lenguaje con [CTC](#connectionist-temporal-classification-ctc) encima del modelo base [`Wav2Vec2Model`].
## I ## I
...@@ -376,7 +376,7 @@ Modelos que generan una nueva secuencia a partir de una entrada, como modelos de ...@@ -376,7 +376,7 @@ Modelos que generan una nueva secuencia a partir de una entrada, como modelos de
### Sharded DDP ### Sharded DDP
Otro nombre para el concepto fundamental de [ZeRO](#zero-redundancy-optimizer--zero-) utilizado por varias otras implementaciones de ZeRO. Otro nombre para el concepto fundamental de [ZeRO](#zero-redundancy-optimizer-zero) utilizado por varias otras implementaciones de ZeRO.
### stride ### stride
......
...@@ -583,7 +583,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") ...@@ -583,7 +583,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
**7. Implementare il forward pass** **7. Implementare il forward pass**
Una volta che i weights pretrained sono stati correttamente caricati in 🤗 Transformers, dovrete assicurarvi che il forward pass Una volta che i weights pretrained sono stati correttamente caricati in 🤗 Transformers, dovrete assicurarvi che il forward pass
sia correttamente implementato. [Qui](#provare-un-pretrained-checkpoint-usando-la-repo-originale), avete give creato e provato sia correttamente implementato. [Qui](#3-4-provare-un-pretrained-checkpoint-usando-la-repo-originale), avete give creato e provato
uno script che testi il forward pass del modello usando la repo originaria. Ora dovrete fare lo stesso con uno script analogo uno script che testi il forward pass del modello usando la repo originaria. Ora dovrete fare lo stesso con uno script analogo
usando l'implementazione in 🤗 Transformers anziché l'originale. Piu o meno lo script dovrebbe essere: usando l'implementazione in 🤗 Transformers anziché l'originale. Piu o meno lo script dovrebbe essere:
......
...@@ -474,8 +474,7 @@ _<model_name>.py` ...@@ -474,8 +474,7 @@ _<model_name>.py`
* Includere l'architettura del modello e le funzioni corrispondenti in [`~onnx.features.FeatureManager`] * Includere l'architettura del modello e le funzioni corrispondenti in [`~onnx.features.FeatureManager`]
* Aggiungere la tua architettura del modello ai test in `test_onnx_v2.py` * Aggiungere la tua architettura del modello ai test in `test_onnx_v2.py`
Scopri come stato contribuito la configurazione per [IBERT] Scopri come stato contribuito la configurazione per [IBERT](https://github.com/huggingface/transformers/pull/14868/files) per
(https://github.com/huggingface/transformers/pull/14868/files) per
avere un'idea di cosa è coinvolto. avere un'idea di cosa è coinvolto.
## TorchScript ## TorchScript
...@@ -612,7 +611,7 @@ Usare il modello tracciato per l'inferenza è semplice come usare il suo metodo ...@@ -612,7 +611,7 @@ Usare il modello tracciato per l'inferenza è semplice come usare il suo metodo
traced_model(tokens_tensor, segments_tensors) traced_model(tokens_tensor, segments_tensors)
``` ```
###Implementare modelli HuggingFace TorchScript su AWS utilizzando Neuron SDK ### Implementare modelli HuggingFace TorchScript su AWS utilizzando Neuron SDK
AWS ha introdotto [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) AWS ha introdotto [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)
famiglia di istanze per l'inferenza di machine learning a basso costo e ad alte prestazioni nel cloud. famiglia di istanze per l'inferenza di machine learning a basso costo e ad alte prestazioni nel cloud.
......
...@@ -571,7 +571,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") ...@@ -571,7 +571,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
**7. 順伝播(forward pass)の実装** **7. 順伝播(forward pass)の実装**
🤗 Transformers実装で事前学習済みの重みを正しく読み込んだ後、順伝播が正しく実装されていることを確認する必要があります。[元のリポジトリを理解する](#34-run-a-pretrained-checkpoint-using-the-original-repository)で、元のリポジトリを使用してモデルの順伝播を実行するスクリプトをすでに作成しました。今度は、元のリポジトリの代わりに🤗 Transformers実装を使用して類似のスクリプトを作成する必要があります。以下のようになります: 🤗 Transformers実装で事前学習済みの重みを正しく読み込んだ後、順伝播が正しく実装されていることを確認する必要があります。[元のリポジトリを理解する](#3-4-run-a-pretrained-checkpoint-using-the-original-repository)で、元のリポジトリを使用してモデルの順伝播を実行するスクリプトをすでに作成しました。今度は、元のリポジトリの代わりに🤗 Transformers実装を使用して類似のスクリプトを作成する必要があります。以下のようになります:
```python ```python
model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder") model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder")
......
...@@ -77,7 +77,7 @@ TensorFlowモデルアーキテクチャを追加するために必要なステ ...@@ -77,7 +77,7 @@ TensorFlowモデルアーキテクチャを追加するために必要なステ
特定のアーキテクチャを決めていない場合、🤗 Transformers チームに提案を求めることは、影響を最大限にする素晴らしい方法です。 特定のアーキテクチャを決めていない場合、🤗 Transformers チームに提案を求めることは、影響を最大限にする素晴らしい方法です。
チームは、TensorFlow サイドで不足している最も注目されるアーキテクチャに向けてガイドします。 チームは、TensorFlow サイドで不足している最も注目されるアーキテクチャに向けてガイドします。
TensorFlow で使用したい特定のモデルに、🤗 Transformers に既に TensorFlow アーキテクチャの実装が存在しているが、重みが不足している場合、 TensorFlow で使用したい特定のモデルに、🤗 Transformers に既に TensorFlow アーキテクチャの実装が存在しているが、重みが不足している場合、
このページの[重みの追加セクション](#adding-tensorflow-weights-to-hub)に直接移動してください。 このページの[重みの追加セクション](#adding-tensorflow-weights-to--hub)に直接移動してください。
簡単にするために、このガイドの残りの部分では、TensorFlow バージョンの *BrandNewBert* を貢献することを決定したと仮定しています 簡単にするために、このガイドの残りの部分では、TensorFlow バージョンの *BrandNewBert* を貢献することを決定したと仮定しています
(これは、[新しいモデルの追加ガイド](add_new_model)での例と同じです)。 (これは、[新しいモデルの追加ガイド](add_new_model)での例と同じです)。
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment