[Docs] Fix broken links and syntax issues (#28918)

* Fix model documentation links in attention.md * Fix external link syntax * Fix target anchor names of section links * Fix copyright statement comments * Fix documentation headings

[Docs] Fix broken links and syntax issues (#28918)
* Fix model documentation links in attention.md * Fix external link syntax * Fix target anchor names of section links * Fix copyright statement comments * Fix documentation headings
2749e479 · Klaus Hipp · GitHub · d6286646 · 2749e479 · 2749e479
Unverified Commit 2749e479 authored Feb 08, 2024 by Klaus Hipp Committed by GitHub Feb 08, 2024
20 changed files
--- a/docs/source/de/add_new_model.md
+++ b/docs/source/de/add_new_model.md
@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
 **7. Implementieren Sie den Vorwärtspass**
 Nachdem es Ihnen gelungen ist, die trainierten Gewichte korrekt in die 🤗 Transformers-Implementierung zu laden, sollten Sie nun dafür sorgen
-sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprünglichen Repository vertraut](#34-run-a-pretrained-checkpoint-using-the-original-repository) haben Sie bereits ein Skript erstellt, das einen Forward Pass
+sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprünglichen Repository vertraut](#3-4-führen-sie-einen-pre-training-checkpoint-mit-dem-original-repository-durch) haben Sie bereits ein Skript erstellt, das einen Forward Pass
 Durchlauf des Modells unter Verwendung des Original-Repositorys durchführt. Jetzt sollten Sie ein analoges Skript schreiben, das die 🤗 Transformers
 Implementierung anstelle der Originalimplementierung verwenden. Es sollte wie folgt aussehen:

--- a/docs/source/de/add_tensorflow_model.md
+++ b/docs/source/de/add_tensorflow_model.md
@@ -83,7 +83,7 @@ Sie sich nicht auf eine bestimmte Architektur festgelegt haben, ist es eine gute
 Wir werden Sie zu den wichtigsten Architekturen führen, die auf der TensorFlow-Seite noch fehlen.
 Seite fehlen. Wenn das spezifische Modell, das Sie mit TensorFlow verwenden möchten, bereits eine Implementierung der TensorFlow-Architektur in
 🤗 Transformers, aber es fehlen Gewichte, können Sie direkt in den
-Abschnitt [Gewichtskonvertierung](#adding-tensorflow-weights-to-hub)
+Abschnitt [Gewichtskonvertierung](#hinzufügen-von-tensorflow-gewichten-zum--hub)
 auf dieser Seite.
 Der Einfachheit halber wird im Rest dieser Anleitung davon ausgegangen, dass Sie sich entschieden haben, mit der TensorFlow-Version von

--- a/docs/source/en/add_new_model.md
+++ b/docs/source/en/add_new_model.md
@@ -682,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
 **7. Implement the forward pass**
 Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make
-sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#34-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward
+sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#3-4-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward
 pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers
 implementation instead of the original one. It should look as follows:

--- a/docs/source/en/add_tensorflow_model.md
+++ b/docs/source/en/add_tensorflow_model.md
@@ -83,7 +83,7 @@ don't have your eyes set on a specific architecture, asking the 🤗 Transformer
 maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow
 side. If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in
 🤗 Transformers but is lacking weights, feel free to jump straight into the
-[weight conversion section](#adding-tensorflow-weights-to-hub)
+[weight conversion section](#adding-tensorflow-weights-to--hub)
 of this page.
 For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of

--- a/docs/source/en/attention.md
+++ b/docs/source/en/attention.md
@@ -22,7 +22,7 @@ use a sparse version of the attention matrix to speed up training.
 ## LSH attention
-[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
+[Reformer](model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
 dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
 the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
 modified to mask the current token (except at the first position), because it will give a query and a key equal (so
@@ -31,7 +31,7 @@ very similar to each other). Since the hash can be a bit random, several hash fu
 ## Local attention
-[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
+[Longformer](model_doc/longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
 left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
 window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
 representation of the whole sentence.
@@ -51,7 +51,7 @@ length.
 ### Axial positional encodings
-[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
+[Reformer](model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
 E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
 hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
 that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with

--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw
  * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
  * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
-  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`].
+  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
 ## I
@@ -422,7 +422,7 @@ Models that generate a new sequence from an input, like translation models, or s
 ### Sharded DDP
-Another name for the foundational [ZeRO](#zero-redundancy-optimizer--zero-) concept as used by various other implementations of ZeRO.
+Another name for the foundational [ZeRO](#zero-redundancy-optimizer-zero) concept as used by various other implementations of ZeRO.
 ### stride

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
-        <!--Copyright 2020 The HuggingFace Team. All rights reserved.
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

--- a/docs/source/en/model_doc/mgp-str.md
+++ b/docs/source/en/model_doc/mgp-str.md
@@ -29,7 +29,7 @@ alt="drawing" width="600"/>
 <small> MGP-STR architecture. Taken from the <a href="https://arxiv.org/abs/2209.03592">original paper</a>. </small>
-MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and SynthText(http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
+MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).
 This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR).
 ## Inference example

--- a/docs/source/en/model_doc/pegasus_x.md
+++ b/docs/source/en/model_doc/pegasus_x.md
@@ -26,7 +26,7 @@ The abstract from the paper is the following:
 *While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.*
-This model was contributed by [zphang](<https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).
+This model was contributed by [zphang](https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus).
 ## Documentation resources

--- a/docs/source/en/model_doc/pvt.md
+++ b/docs/source/en/model_doc/pvt.md
@@ -38,7 +38,7 @@ object detection, instance and semantic segmentation. For example, with a compar
 achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope 
 that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.*
-This model was contributed by [Xrenya](<https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).
+This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).
 - PVTv1 on ImageNet-1K

--- a/docs/source/en/model_doc/t5.md
+++ b/docs/source/en/model_doc/t5.md
@@ -60,7 +60,7 @@ for summarization: *summarize: ...*.
 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
- See the [training](#training), [inference](#inference) and [scripts](#scripts) sections below for all details regarding usage.
+- See the [training](#training), [inference](#inference) and [resources](#resources) sections below for all details regarding usage.
 T5 comes in different sizes:

--- a/docs/source/en/perf_train_gpu_one.md
+++ b/docs/source/en/perf_train_gpu_one.md
@@ -51,7 +51,7 @@ The methods and tools covered in this guide can be classified based on the effec
 | [Data preloading](#data-preloading)                        | Yes                     | No                           |
 | [DeepSpeed Zero](#deepspeed-zero)                          | No                      | Yes                          |
 | [torch.compile](#using-torchcompile)                       | Yes                     | No                           |
-| [Parameter-Efficient Fine Tuning (PEFT)](#peft)            | No                      | Yes                          |
+| [Parameter-Efficient Fine Tuning (PEFT)](#using--peft)            | No                      | Yes                          |
 <Tip>
@@ -62,12 +62,12 @@ large model and a small batch size, the memory use will be larger.
 You can combine the above methods to get a cumulative effect. These techniques are available to you whether you are 
 training your model with [`Trainer`] or writing a pure PyTorch loop, in which case you can [configure these optimizations 
-with 🤗 Accelerate](#using-accelerate).
+with 🤗 Accelerate](#using--accelerate).
 If these methods do not result in sufficient gains, you can explore the following options: 
 * [Look into building your own custom Docker container with efficient softare prebuilds](#efficient-software-prebuilds)
 * [Consider a model that uses Mixture of Experts (MoE)](#mixture-of-experts)
-* [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention)
+* [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention-and-flash-attention)
 Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving 
 to a multi-GPU setup. All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism 
@@ -110,7 +110,7 @@ training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumu
 In the above example, your effective batch size becomes 4. 
 Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example 
-[further down in this guide](#using-accelerate).
+[further down in this guide](#using--accelerate).
 While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can 
 result in a more pronounced training slowdown. Consider the following example. Let's say, the `per_device_train_batch_size=4` 
@@ -143,7 +143,7 @@ training_args = TrainingArguments(
 )
 ```
-Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using-accelerate). 
+Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using--accelerate). 
 <Tip>
@@ -179,7 +179,7 @@ To enable mixed precision training, set the `fp16` flag to `True`:
 training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
 ```
-If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using-accelerate). 
+If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using--accelerate). 
 ### BF16

--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@@ -214,7 +214,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut
 <Tip warning={true}>
-Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m]() model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
+Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
 </Tip>

--- a/docs/source/en/tasks/idefics.md
+++ b/docs/source/en/tasks/idefics.md
@@ -36,7 +36,7 @@ being a large model means it requires significant computational resources and in
 this approach suits your use case better than fine-tuning specialized models for each individual task. 
 In this guide, you'll learn how to: 
- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#loading-the-quantized-version-of-the-model)
+- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model)
 - Use IDEFICS for: 
  - [Image captioning](#image-captioning)
  - [Prompted image captioning](#prompted-image-captioning)

--- a/docs/source/en/tasks/prompting.md
+++ b/docs/source/en/tasks/prompting.md
@@ -35,7 +35,7 @@ practices that help to achieve optimal results more consistently.
 This guide covers the prompt engineering best practices to help you craft better LLM prompts and solve various NLP tasks. 
 You'll learn:
- [Basics of prompting](#basic-prompts)
+- [Basics of prompting](#basics-of-prompting)
 - [Best practices of LLM prompting](#best-practices-of-llm-prompting)
 - [Advanced prompting techniques: few-shot prompting and chain-of-thought](#advanced-prompting-techniques)
 - [When to fine-tune instead of prompting](#prompting-vs-fine-tuning)

--- a/docs/source/es/glossary.md
+++ b/docs/source/es/glossary.md
@@ -165,7 +165,7 @@ La cabecera del modelo se refiere a la última capa de una red neuronal que acep
  * [`GPT2ForSequenceClassification`] es una cabecera de clasificación de secuencias, es decir, una capa lineal, encima del modelo base [`GPT2Model`].
  * [`ViTForImageClassification`] es una cabecera de clasificación de imágenes, es decir, una capa lineal encima del estado oculto final del token `CLS`, encima del modelo base [`ViTModel`].
-  * [`Wav2Vec2ForCTC`] es una cabecera de modelado de lenguaje con [CTC](#connectionist-temporal-classification-(CTC)) encima del modelo base [`Wav2Vec2Model`].
+  * [`Wav2Vec2ForCTC`] es una cabecera de modelado de lenguaje con [CTC](#connectionist-temporal-classification-ctc) encima del modelo base [`Wav2Vec2Model`].
 ## I
@@ -376,7 +376,7 @@ Modelos que generan una nueva secuencia a partir de una entrada, como modelos de
 ### Sharded DDP
-Otro nombre para el concepto fundamental de [ZeRO](#zero-redundancy-optimizer--zero-) utilizado por varias otras implementaciones de ZeRO.
+Otro nombre para el concepto fundamental de [ZeRO](#zero-redundancy-optimizer-zero) utilizado por varias otras implementaciones de ZeRO.
 ### stride

--- a/docs/source/it/add_new_model.md
+++ b/docs/source/it/add_new_model.md
@@ -583,7 +583,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
 **7. Implementare il forward pass**
 Una volta che i weights pretrained sono stati correttamente caricati in 🤗 Transformers, dovrete assicurarvi che il forward pass 
-sia correttamente implementato. [Qui](#provare-un-pretrained-checkpoint-usando-la-repo-originale), avete give creato e provato
+sia correttamente implementato. [Qui](#3-4-provare-un-pretrained-checkpoint-usando-la-repo-originale), avete give creato e provato
 uno script che testi il forward pass del modello usando la repo originaria. Ora dovrete fare lo stesso con uno script analogo 
 usando l'implementazione in 🤗 Transformers anziché l'originale. Piu o meno lo script dovrebbe essere:

--- a/docs/source/it/serialization.md
+++ b/docs/source/it/serialization.md
@@ -474,8 +474,7 @@ _<model_name>.py`
 * Includere l'architettura del modello e le funzioni corrispondenti in [`~onnx.features.FeatureManager`]
 * Aggiungere la tua architettura del modello ai test in `test_onnx_v2.py`
-Scopri come stato contribuito la configurazione per [IBERT]
+Scopri come stato contribuito la configurazione per [IBERT](https://github.com/huggingface/transformers/pull/14868/files) per
-(https://github.com/huggingface/transformers/pull/14868/files) per
 avere un'idea di cosa è coinvolto.
 ## TorchScript
@@ -612,7 +611,7 @@ Usare il modello tracciato per l'inferenza è semplice come usare il suo metodo
 traced_model(tokens_tensor, segments_tensors)
 ```
-###Implementare modelli HuggingFace TorchScript su AWS utilizzando Neuron SDK
+### Implementare modelli HuggingFace TorchScript su AWS utilizzando Neuron SDK
 AWS ha introdotto [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)
 famiglia di istanze per l'inferenza di machine learning a basso costo e ad alte prestazioni nel cloud.

--- a/docs/source/ja/add_new_model.md
+++ b/docs/source/ja/add_new_model.md
@@ -571,7 +571,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder")
 **7. 順伝播（forward pass）の実装**
-🤗 Transformers実装で事前学習済みの重みを正しく読み込んだ後、順伝播が正しく実装されていることを確認する必要があります。[元のリポジトリを理解する](#34-run-a-pretrained-checkpoint-using-the-original-repository)で、元のリポジトリを使用してモデルの順伝播を実行するスクリプトをすでに作成しました。今度は、元のリポジトリの代わりに🤗 Transformers実装を使用して類似のスクリプトを作成する必要があります。以下のようになります：
+🤗 Transformers実装で事前学習済みの重みを正しく読み込んだ後、順伝播が正しく実装されていることを確認する必要があります。[元のリポジトリを理解する](#3-4-run-a-pretrained-checkpoint-using-the-original-repository)で、元のリポジトリを使用してモデルの順伝播を実行するスクリプトをすでに作成しました。今度は、元のリポジトリの代わりに🤗 Transformers実装を使用して類似のスクリプトを作成する必要があります。以下のようになります：
 ```python
 model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder")

--- a/docs/source/ja/add_tensorflow_model.md
+++ b/docs/source/ja/add_tensorflow_model.md
@@ -77,7 +77,7 @@ TensorFlowモデルアーキテクチャを追加するために必要なステ
 特定のアーキテクチャを決めていない場合、🤗 Transformers チームに提案を求めることは、影響を最大限にする素晴らしい方法です。
 チームは、TensorFlow サイドで不足している最も注目されるアーキテクチャに向けてガイドします。
 TensorFlow で使用したい特定のモデルに、🤗 Transformers に既に TensorFlow アーキテクチャの実装が存在しているが、重みが不足している場合、
-このページの[重みの追加セクション](#adding-tensorflow-weights-to-hub)に直接移動してください。
+このページの[重みの追加セクション](#adding-tensorflow-weights-to--hub)に直接移動してください。
 簡単にするために、このガイドの残りの部分では、TensorFlow バージョンの *BrandNewBert* を貢献することを決定したと仮定しています
 （これは、[新しいモデルの追加ガイド](add_new_model)での例と同じです）。