Framework split (#16030)

* First files * More files * Last files * Style

Framework split (#16030)
* First files * More files * Last files * Style
4f4e5ddb · Sylvain Gugger · GitHub · 4a353cac · 4f4e5ddb · 4f4e5ddb
Unverified Commit 4f4e5ddb authored Mar 15, 2022 by Sylvain Gugger Committed by GitHub Mar 15, 2022
17 changed files
--- a/docs/source/autoclass_tutorial.mdx
+++ b/docs/source/autoclass_tutorial.mdx
@@ -77,16 +77,14 @@ Load a processor with [`AutoProcessor.from_pretrained`]:
 ## AutoModel
+<frameworkcontent>
+<pt>
 Finally, the `AutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`AutoModelForSequenceClassification.from_pretrained`]:
 ```py
 >>> from transformers import AutoModelForSequenceClassification
 >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
-===PT-TF-SPLIT===
->>> from transformers import TFAutoModelForSequenceClassification
->>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
 ```
 Easily reuse the same checkpoint to load an architecture for a different task:
@@ -95,10 +93,27 @@ Easily reuse the same checkpoint to load an architecture for a different task:
 >>> from transformers import AutoModelForTokenClassification
 >>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
-===PT-TF-SPLIT===
+```
+Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
+</pt>
+<tf>
+Finally, the `TFAutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`TFAutoModelForSequenceClassification.from_pretrained`]:
+```py
+>>> from transformers import TFAutoModelForSequenceClassification
+>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
+```
+Easily reuse the same checkpoint to load an architecture for a different task:
+```py
 >>> from transformers import TFAutoModelForTokenClassification
 >>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
 ```
-Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
+Generally, we recommend using the `AutoTokenizer` class and the `TFAutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
\ No newline at end of file
+</tf>
+</frameworkcontent>
--- a/docs/source/benchmarks.mdx
+++ b/docs/source/benchmarks.mdx
@@ -39,12 +39,17 @@ backward pass.
 The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an object of type [`PyTorchBenchmarkArguments`] and
 [`TensorFlowBenchmarkArguments`], respectively, for instantiation. [`PyTorchBenchmarkArguments`] and [`TensorFlowBenchmarkArguments`] are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type _bert-base-cased_ can be benchmarked.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
 >>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
 >>> benchmark = PyTorchBenchmark(args)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments
 >>> args = TensorFlowBenchmarkArguments(
@@ -52,6 +57,8 @@ The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an
 ... )
 >>> benchmark = TensorFlowBenchmark(args)
 ```
+</tf>
+</frameworkcontent>
 Here, three arguments are given to the benchmark argument data classes, namely `models`, `batch_sizes`, and
 `sequence_lengths`. The argument `models` is required and expects a `list` of model identifiers from the
@@ -63,11 +70,10 @@ and `src/transformers/benchmark/benchmark_args_tf.py` (for Tensorflow). Alternat
 commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow
 respectively.
+<frameworkcontent>
+<pt>
 ```bash
 python examples/pytorch/benchmarking/run_benchmark.py --help
-===PT-TF-SPLIT===
-python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
 ```
 An instantiated benchmark object can then simply be run by calling `benchmark.run()`.
@@ -118,8 +124,18 @@ bert-base-uncased          8              512            1539
 - gpu_power_watts: 280.0
 - gpu_performance_state: 2
 - use_tpu: False
+```
+</pt>
+<tf>
+```bash
+python examples/tensorflow/benchmarking/run_benchmark_tf.py --help
+```
-===PT-TF-SPLIT===
+An instantiated benchmark object can then simply be run by calling `benchmark.run()`.
+```py
+>>> results = benchmark.run()
+>>> print(results)
 >>> results = benchmark.run()
 >>> print(results)
 ====================       INFERENCE - SPEED - RESULT       ====================
@@ -166,6 +182,8 @@ bert-base-uncased          8              512            1770
 - gpu_performance_state: 2
 - use_tpu: False
 ```
+</tf>
+</frameworkcontent>
 By default, the _time_ and the _required memory_ for _inference_ are benchmarked. In the example output above the first
 two sections show the result corresponding to _inference time_ and _inference memory_. In addition, all relevant
@@ -179,6 +197,8 @@ Instead of benchmarking pre-trained models via their model identifier, _e.g._ `b
 alternatively benchmark an arbitrary configuration of any available model class. In this case, a `list` of
 configurations must be inserted with the benchmark args as follows.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig
@@ -250,8 +270,10 @@ bert-6-lay                 8              512            1359
 - gpu_power_watts: 280.0
 - gpu_performance_state: 2
 - use_tpu: False
+```
-===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig
 >>> args = TensorFlowBenchmarkArguments(
@@ -323,6 +345,8 @@ bert-6-lay                 8              512            1540
 - gpu_performance_state: 2
 - use_tpu: False
 ```
+</tf>
+</frameworkcontent>
 Again, _inference time_ and _required memory_ for _inference_ are measured, but this time for customized configurations
 of the `BertModel` class. This feature can especially be helpful when deciding for which configuration the model

--- a/docs/source/create_a_model.mdx
+++ b/docs/source/create_a_model.mdx
@@ -107,6 +107,8 @@ You can also save your configuration file as a dictionary or even just the diffe
 The next step is to create a [model](main_classes/models). The model - also loosely referred to as the architecture - defines what each layer is doing and what operations are happening. Attributes like `num_hidden_layers` from the configuration are used to define the architecture. Every model shares the base class [`PreTrainedModel`] and a few common methods like resizing input embeddings and pruning self-attention heads. In addition, all models are also either a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) or [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. This means models are compatible with each of their respective framework's usage.
+<frameworkcontent>
+<pt>
 Load your custom configuration attributes into the model:
 ```py
@@ -114,7 +116,26 @@ Load your custom configuration attributes into the model:
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
 >>> model = DistilBertModel(my_config)
-===PT-TF-SPLIT===
+```
+This creates a model with random values instead of pretrained weights. You won't be able to use this model for anything useful yet until you train it. Training is a costly and time-consuming process. It is generally better to use a pretrained model to obtain better results faster, while using only a fraction of the resources required for training.
+Create a pretrained model with [`~PreTrainedModel.from_pretrained`]:
+```py
+>>> model = DistilBertModel.from_pretrained("distilbert-base-uncased")
+```
+When you load pretrained weights, the default model configuration is automatically loaded if the model is provided by 🤗 Transformers. However, you can still replace - some or all of - the default model configuration attributes with your own if you'd like:
+```py
+>>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config)
+```
+</pt>
+<tf>
+Load your custom configuration attributes into the model:
+```py
 >>> from transformers import TFDistilBertModel
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
@@ -123,36 +144,32 @@ Load your custom configuration attributes into the model:
 This creates a model with random values instead of pretrained weights. You won't be able to use this model for anything useful yet until you train it. Training is a costly and time-consuming process. It is generally better to use a pretrained model to obtain better results faster, while using only a fraction of the resources required for training.
-Create a pretrained model with [`~PreTrainedModel.from_pretrained`]:
+Create a pretrained model with [`~TFPreTrainedModel.from_pretrained`]:
 ```py
->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased")
-===PT-TF-SPLIT===
 >>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
 ```
 When you load pretrained weights, the default model configuration is automatically loaded if the model is provided by 🤗 Transformers. However, you can still replace - some or all of - the default model configuration attributes with your own if you'd like:
 ```py
->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config)
-===PT-TF-SPLIT===
 >>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config)
 ```
+</tf>
+</frameworkcontent>
 ### Model heads
 At this point, you have a base DistilBERT model which outputs the *hidden states*. The hidden states are passed as inputs to a model head to produce the final output. 🤗 Transformers provides a different model head for each task as long as a model supports the task (i.e., you can't use DistilBERT for a sequence-to-sequence task like translation).
+<frameworkcontent>
+<pt>
 For example, [`DistilBertForSequenceClassification`] is a base DistilBERT model with a sequence classification head. The sequence classification head is a linear layer on top of the pooled outputs.
 ```py
 >>> from transformers import DistilBertForSequenceClassification
 >>> model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
-===PT-TF-SPLIT===
->>> from transformers import TFDistilBertForSequenceClassification
->>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
 ```
 Easily reuse this checkpoint for another task by switching to a different model head. For a question answering task, you would use the [`DistilBertForQuestionAnswering`] model head. The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output.
@@ -161,11 +178,26 @@ Easily reuse this checkpoint for another task by switching to a different model
 >>> from transformers import DistilBertForQuestionAnswering
 >>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+For example, [`TFDistilBertForSequenceClassification`] is a base DistilBERT model with a sequence classification head. The sequence classification head is a linear layer on top of the pooled outputs.
+```py
+>>> from transformers import TFDistilBertForSequenceClassification
+>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
+```
+Easily reuse this checkpoint for another task by switching to a different model head. For a question answering task, you would use the [`TFDistilBertForQuestionAnswering`] model head. The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output.
+```py
 >>> from transformers import TFDistilBertForQuestionAnswering
 >>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
 ```
+</tf>
+</frameworkcontent>
 ## Tokenizer

--- a/docs/source/model_doc/tapas.mdx
+++ b/docs/source/model_doc/tapas.mdx
@@ -67,9 +67,10 @@ To summarize:
 | Weak supervision for aggregation    | WTQ                 | Questions might involve aggregation, and the model must learn this given only the answer as supervision |
 | Strong supervision for aggregation  | WikiSQL-supervised  | Questions might involve aggregation, and the model must learn this given the gold aggregation operator  |
+<frameworkcontent>
+<pt>
 Initializing a model with a pre-trained base and randomly initialized classification heads from the hub can be done as shown below. Be sure to have installed the
-[torch-scatter](https://github.com/rusty1s/pytorch_scatter) dependency for your environment in case you're using PyTorch, or the [tensorflow_probability](https://github.com/tensorflow/probability)
+[torch-scatter](https://github.com/rusty1s/pytorch_scatter) dependency:
-dependency in case you're using Tensorflow:
 ```py
 >>> from transformers import TapasConfig, TapasForQuestionAnswering
@@ -84,7 +85,23 @@ dependency in case you're using Tensorflow:
 >>> # or, the base sized model with WikiSQL configuration
 >>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
 >>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
-===PT-TF-SPLIT===
+```
+Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters you want when initializing [`TapasConfig`], and then create a [`TapasForQuestionAnswering`] based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way. Here's an example:
+```py
+>>> from transformers import TapasConfig, TapasForQuestionAnswering
+>>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
+>>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
+>>> # initializing the pre-trained base sized model with our custom classification heads
+>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
+```
+</pt>
+<tf>
+Initializing a model with a pre-trained base and randomly initialized classification heads from the hub can be done as shown below. Be sure to have installed the [tensorflow_probability](https://github.com/tensorflow/probability) dependency:
+```py
 >>> from transformers import TapasConfig, TFTapasForQuestionAnswering
 >>> # for example, the base sized model with default SQA configuration
@@ -99,16 +116,9 @@ dependency in case you're using Tensorflow:
 >>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
 ```
-Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters you want when initializing [`TapasConfig`], and then create a [`TapasForQuestionAnswering`] based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way. Here's an example:
+Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters you want when initializing [`TapasConfig`], and then create a [`TFTapasForQuestionAnswering`] based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way. Here's an example:
 ```py
->>> from transformers import TapasConfig, TapasForQuestionAnswering
->>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
->>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
->>> # initializing the pre-trained base sized model with our custom classification heads
->>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
-===PT-TF-SPLIT===
 >>> from transformers import TapasConfig, TFTapasForQuestionAnswering
 >>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
@@ -116,6 +126,8 @@ Of course, you don't necessarily have to follow one of these three ways in which
 >>> # initializing the pre-trained base sized model with our custom classification heads
 >>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
 ```
+</tf>
+</frameworkcontent>
 What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See [here](https://github.com/google-research/tapas/issues/91#issuecomment-735719340) for more info.
@@ -137,9 +149,11 @@ Second, no matter what you picked above, you should prepare your dataset in the
 The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the TAPAS algorithm used conversion scripts with some automated logic to convert the other datasets (WTQ, WikiSQL) into the SQA format. The author explains this [here](https://github.com/google-research/tapas/issues/50#issuecomment-705465960). A conversion of this script that works with HuggingFace's implementation can be found [here](https://github.com/NielsRogge/tapas_utils). Interestingly, these conversion scripts are not perfect (the `answer_coordinates` and `float_answer` fields are populated based on the `answer_text`), meaning that WTQ and WikiSQL results could actually be improved.
-**STEP 3: Convert your data into PyTorch/TensorFlow tensors using TapasTokenizer**
+**STEP 3: Convert your data into tensors using TapasTokenizer**
-Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TapasForQuestionAnswering`]/[`TFTapasForQuestionAnswering`] requires different
+<frameworkcontent>
+<pt>
+Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TapasForQuestionAnswering`] requires different
 inputs to be fine-tuned:
 | **Task**                           | **Required inputs**                                                                                                 |
@@ -177,33 +191,6 @@ inputs to be fine-tuned:
 >>> inputs
 {'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
 'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
-===PT-TF-SPLIT===
->>> from transformers import TapasTokenizer
->>> import pandas as pd
->>> model_name = "google/tapas-base"
->>> tokenizer = TapasTokenizer.from_pretrained(model_name)
->>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
->>> queries = [
-...     "What is the name of the first actor?",
-...     "How many movies has George Clooney played in?",
-...     "What is the total number of movies?",
-... ]
->>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
->>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
->>> table = pd.DataFrame.from_dict(data)
->>> inputs = tokenizer(
-...     table=table,
-...     queries=queries,
-...     answer_coordinates=answer_coordinates,
-...     answer_text=answer_text,
-...     padding="max_length",
-...     return_tensors="tf",
-... )
->>> inputs
-{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
-'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
 ```
 Note that [`TapasTokenizer`] expects the data of the table to be **text-only**. You can use `.astype(str)` on a dataframe to turn it into text-only data.
@@ -249,7 +236,53 @@ Of course, this only shows how to encode a single training example. It is advise
 >>> data = pd.read_csv(tsv_path, sep="\t")
 >>> train_dataset = TableDataset(data, tokenizer)
 >>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TFTapasForQuestionAnswering`] requires different
+inputs to be fine-tuned:
+| **Task**                           | **Required inputs**                                                                                                 |
+|------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| Conversational                     | `input_ids`, `attention_mask`, `token_type_ids`, `labels`                                                           |
+|  Weak supervision for aggregation  | `input_ids`, `attention_mask`, `token_type_ids`, `labels`, `numeric_values`, `numeric_values_scale`, `float_answer` |
+| Strong supervision for aggregation | `input ids`, `attention mask`, `token type ids`, `labels`, `aggregation_labels`                                     |
+[`TapasTokenizer`] creates the `labels`, `numeric_values` and `numeric_values_scale` based on the `answer_coordinates` and `answer_text` columns of the TSV file. The `float_answer` and `aggregation_labels` are already in the TSV file of step 2. Here's an example:
+```py
+>>> from transformers import TapasTokenizer
+>>> import pandas as pd
+>>> model_name = "google/tapas-base"
+>>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
+>>> queries = [
+...     "What is the name of the first actor?",
+...     "How many movies has George Clooney played in?",
+...     "What is the total number of movies?",
+... ]
+>>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
+>>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
+>>> table = pd.DataFrame.from_dict(data)
+>>> inputs = tokenizer(
+...     table=table,
+...     queries=queries,
+...     answer_coordinates=answer_coordinates,
+...     answer_text=answer_text,
+...     padding="max_length",
+...     return_tensors="tf",
+... )
+>>> inputs
+{'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
+'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
+```
+Note that [`TapasTokenizer`] expects the data of the table to be **text-only**. You can use `.astype(str)` on a dataframe to turn it into text-only data.
+Of course, this only shows how to encode a single training example. It is advised to create a dataloader to iterate over batches:
+```py
 >>> import tensorflow as tf
 >>> import pandas as pd
@@ -302,13 +335,17 @@ Of course, this only shows how to encode a single training example. It is advise
 ... )
 >>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)
 ```
+</tf>
+</frameworkcontent>
 Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group together the `queries`, `answer_coordinates` and `answer_text` per table (in the order of their `position`
 index) and batch encode each table with its questions. This will make sure that the `prev_labels` token types (see docs of [`TapasTokenizer`]) are set correctly. See [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info. See [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info regarding using the TensorFlow model.
-**STEP 4: Train (fine-tune) TapasForQuestionAnswering/TFTapasForQuestionAnswering**
+**STEP 4: Train (fine-tune) the model
-You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case):
+<frameworkcontent>
+<pt>
+You can then fine-tune [`TapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case):
 ```py
 >>> from transformers import TapasConfig, TapasForQuestionAnswering, AdamW
@@ -357,7 +394,12 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw
 ...         loss = outputs.loss
 ...         loss.backward()
 ...         optimizer.step()
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+You can then fine-tune [`TFTapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case):
+```py
 >>> import tensorflow as tf
 >>> from transformers import TapasConfig, TFTapasForQuestionAnswering
@@ -402,9 +444,13 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw
 ...         grads = tape.gradient(outputs.loss, model.trainable_weights)
 ...         optimizer.apply_gradients(zip(grads, model.trainable_weights))
 ```
+</tf>
+</frameworkcontent>
 ## Usage: inference
+<frameworkcontent>
+<pt>
 Here we explain how you can use [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnswering`] for inference (i.e. making predictions on new data). For inference, only `input_ids`, `attention_mask` and `token_type_ids` (which you can obtain using [`TapasTokenizer`]) have to be provided to the model to obtain the logits. Next, you can use the handy [`~models.tapas.tokenization_tapas.convert_logits_to_predictions`] method to convert these into predicted coordinates and optional aggregation indices.
 However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example of that:
@@ -460,7 +506,14 @@ How many movies has George Clooney played in?
 Predicted answer: COUNT > 69
 What is the total number of movies?
 Predicted answer: SUM > 87, 53, 69
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+Here we explain how you can use [`TFTapasForQuestionAnswering`] for inference (i.e. making predictions on new data). For inference, only `input_ids`, `attention_mask` and `token_type_ids` (which you can obtain using [`TapasTokenizer`]) have to be provided to the model to obtain the logits. Next, you can use the handy [`~models.tapas.tokenization_tapas.convert_logits_to_predictions`] method to convert these into predicted coordinates and optional aggregation indices.
+However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example of that:
+```py
 >>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
 >>> import pandas as pd
@@ -512,6 +565,8 @@ Predicted answer: COUNT > 69
 What is the total number of movies?
 Predicted answer: SUM > 87, 53, 69
 ```
+</tf>
+</frameworkcontent>
 In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for PyTorch) and [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for TensorFlow).

--- a/docs/source/preprocessing.mdx
+++ b/docs/source/preprocessing.mdx
@@ -149,6 +149,9 @@ Finally, you want the tokenizer to return the actual tensors that are fed to the
 Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:
+<frameworkcontent>
+<pt>
 ```py
 >>> batch_sentences = [
 ...     "But what about second breakfast?",
@@ -163,7 +166,10 @@ Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for Tenso
                           [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> batch_sentences = [
 ...     "But what about second breakfast?",
 ...     "Don't think he knows about second breakfast, Pip.",
@@ -182,6 +188,8 @@ array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}
 ```
+</tf>
+</frameworkcontent>
 ## Audio

--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@@ -62,11 +62,18 @@ In the following example, you will use the [`pipeline`] for sentiment analysis.
 Install the following dependencies if you haven't already:
+<frameworkcontent>
+<pt>
 ```bash
 pip install torch
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```bash
 pip install tensorflow
 ```
+</tf>
+</frameworkcontent>
 Import [`pipeline`] and specify the task you want to complete:
@@ -137,19 +144,28 @@ The [`pipeline`] can accommodate any model from the [Model Hub](https://huggingf
 >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 ```
-Use the [`AutoModelForSequenceClassification`] and ['AutoTokenizer'] to load the pretrained model and it's associated tokenizer (more on an `AutoClass` below):
+<frameworkcontent>
+<pt>
+Use the [`AutoModelForSequenceClassification`] and [`AutoTokenizer`] to load the pretrained model and it's associated tokenizer (more on an `AutoClass` below):
 ```py
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
 >>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
->>> # ===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+Use the [`TFAutoModelForSequenceClassification`] and [`AutoTokenizer`] to load the pretrained model and it's associated tokenizer (more on an `TFAutoClass` below):
+```py
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
 >>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
 ```
+</tf>
+</frameworkcontent>
 Then you can specify the model and tokenizer in the [`pipeline`], and apply the `classifier` on your target text:
@@ -201,6 +217,8 @@ The tokenizer will return a dictionary containing:
 Just like the [`pipeline`], the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:
+<frameworkcontent>
+<pt>
 ```py
 >>> pt_batch = tokenizer(
 ...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
@@ -209,7 +227,10 @@ Just like the [`pipeline`], the tokenizer will accept a list of inputs. In addit
 ...     max_length=512,
 ...     return_tensors="pt",
 ... )
->>> # ===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> tf_batch = tokenizer(
 ...     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
 ...     padding=True,
@@ -218,23 +239,22 @@ Just like the [`pipeline`], the tokenizer will accept a list of inputs. In addit
 ...     return_tensors="tf",
 ... )
 ```
+</tf>
+</frameworkcontent>
 Read the [preprocessing](./preprocessing) tutorial for more details about tokenization.
 ### AutoModel
-🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an [`AutoModel`] like you would load an [`AutoTokenizer`]. The only difference is selecting the correct [`AutoModel`] for the task. Since you are doing text - or sequence - classification, load [`AutoModelForSequenceClassification`]. The TensorFlow equivalent is simply [`TFAutoModelForSequenceClassification`]:
+<frameworkcontent>
+<pt>
+🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an [`AutoModel`] like you would load an [`AutoTokenizer`]. The only difference is selecting the correct [`AutoModel`] for the task. Since you are doing text - or sequence - classification, load [`AutoModelForSequenceClassification`]:
 ```py
 >>> from transformers import AutoModelForSequenceClassification
 >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
->>> # ===PT-TF-SPLIT===
->>> from transformers import TFAutoModelForSequenceClassification
->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
 ```
 <Tip>
@@ -243,12 +263,10 @@ See the [task summary](./task_summary) for which [`AutoModel`] class to use for
 </Tip>
-Now you can pass your preprocessed batch of inputs directly to the model. If you are using a PyTorch model, unpack the dictionary by adding `**`. For TensorFlow models, pass the dictionary keys directly to the tensors:
+Now you can pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding `**`:
 ```py
 >>> pt_outputs = pt_model(**pt_batch)
->>> # ===PT-TF-SPLIT===
->>> tf_outputs = tf_model(tf_batch)
 ```
 The model outputs the final activations in the `logits` attribute. Apply the softmax function to the `logits` to retrieve the probabilities:
@@ -260,8 +278,33 @@ The model outputs the final activations in the `logits` attribute. Apply the sof
 >>> print(pt_predictions)
 tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
+```
+</pt>
+<tf>
+🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an [`TFAutoModel`] like you would load an [`AutoTokenizer`]. The only difference is selecting the correct [`TFAutoModel`] for the task. Since you are doing text - or sequence - classification, load [`TFAutoModelForSequenceClassification`]:
+```py
+>>> from transformers import TFAutoModelForSequenceClassification
+>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
+>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
+```
+<Tip>
+See the [task summary](./task_summary) for which [`AutoModel`] class to use for which task.
->>> # ===PT-TF-SPLIT===
+</Tip>
+Now you can pass your preprocessed batch of inputs directly to the model by passing the dictionary keys directly to the tensors:
+```py
+>>> tf_outputs = tf_model(tf_batch)
+```
+The model outputs the final activations in the `logits` attribute. Apply the softmax function to the `logits` to retrieve the probabilities:
+```py
 >>> import tensorflow as tf
 >>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
@@ -270,6 +313,8 @@ tf.Tensor(
 [[0.0021 0.0018 0.0116 0.2121 0.7725]
 [0.2084 0.1826 0.1969 0.1755  0.2365]], shape=(2, 5), dtype=float32)
 ```
+</tf>
+</frameworkcontent>
 <Tip>
@@ -289,36 +334,56 @@ The model outputs also behave like a tuple or a dictionary (e.g., you can index
 ### Save a model
+<frameworkcontent>
+<pt>
 Once your model is fine-tuned, you can save it with its tokenizer using [`PreTrainedModel.save_pretrained`]:
 ```py
 >>> pt_save_directory = "./pt_save_pretrained"
 >>> tokenizer.save_pretrained(pt_save_directory)  # doctest: +IGNORE_RESULT
 >>> pt_model.save_pretrained(pt_save_directory)
->>> # ===PT-TF-SPLIT===
+```
+When you are ready to use the model again, reload it with [`PreTrainedModel.from_pretrained`]:
+```py
+>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
+```
+</pt>
+<tf>
+Once your model is fine-tuned, you can save it with its tokenizer using [`TFPreTrainedModel.save_pretrained`]:
+```py
 >>> tf_save_directory = "./tf_save_pretrained"
 >>> tokenizer.save_pretrained(tf_save_directory)  # doctest: +IGNORE_RESULT
 >>> tf_model.save_pretrained(tf_save_directory)
 ```
-When you are ready to use the model again, reload it with [`PreTrainedModel.from_pretrained`]:
+When you are ready to use the model again, reload it with [`TFPreTrainedModel.from_pretrained`]:
 ```py
->>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
->>> # ===PT-TF-SPLIT===
 >>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")
 ```
+</tf>
+</frameworkcontent>
 One particularly cool 🤗 Transformers feature is the ability to save a model and reload it as either a PyTorch or TensorFlow model. The `from_pt` or `from_tf` parameter can convert the model from one framework to the other:
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoModel
 >>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
->>> # ===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import TFAutoModel
 >>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
 >>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True)
 ```
+</tf>
+</frameworkcontent>
--- a/docs/source/run_scripts.mdx
+++ b/docs/source/run_scripts.mdx
@@ -81,6 +81,8 @@ pip install -r requirements.txt
 ## Run a script
+<frameworkcontent>
+<pt>
 The example script downloads and preprocesses a dataset from the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. Then the script fine-tunes a dataset with the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) on an architecture that supports summarization. The following example shows how to fine-tune [T5-small](https://huggingface.co/t5-small) on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset. The T5 model requires an additional `source_prefix` argument due to how it was trained. This prompt lets T5 know this is a summarization task.
 ```bash
@@ -96,7 +98,12 @@ python examples/pytorch/summarization/run_summarization.py \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+The example script downloads and preprocesses a dataset from the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. Then the script fine-tunes a dataset using Keras on an architecture that supports summarization. The following example shows how to fine-tune [T5-small](https://huggingface.co/t5-small) on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset. The T5 model requires an additional `source_prefix` argument due to how it was trained. This prompt lets T5 know this is a summarization task.
+```bash
 python examples/tensorflow/summarization/run_summarization.py  \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
@@ -108,6 +115,8 @@ python examples/tensorflow/summarization/run_summarization.py  \
    --do_train \
    --do_eval
 ```
+</tf>
+</frameworkcontent>
 ## Distributed training and mixed precision
@@ -137,10 +146,10 @@ TensorFlow scripts utilize a [`MirroredStrategy`](https://www.tensorflow.org/gui
 ## Run a script on a TPU
+<frameworkcontent>
+<pt>
 Tensor Processing Units (TPUs) are specifically designed to accelerate performance. PyTorch supports TPUs with the [XLA](https://www.tensorflow.org/xla) deep learning compiler (see [here](https://github.com/pytorch/xla/blob/master/README.md) for more details). To use a TPU, launch the `xla_spawn.py` script and use the `num_cores` argument to set the number of TPU cores you want to use.
-TensorFlow scripts utilize a [`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy) for training on TPUs. To use a TPU, pass the name of the TPU resource to the `tpu` argument.
 ```bash
 python xla_spawn.py --num_cores 8 \
    summarization/run_summarization.py \
@@ -155,7 +164,12 @@ python xla_spawn.py --num_cores 8 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+Tensor Processing Units (TPUs) are specifically designed to accelerate performance. TensorFlow scripts utilize a [`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy) for training on TPUs. To use a TPU, pass the name of the TPU resource to the `tpu` argument.
+```bash
 python run_summarization.py  \
    --tpu name_of_tpu_resource \
    --model_name_or_path t5-small \
@@ -168,6 +182,8 @@ python run_summarization.py  \
    --do_train \
    --do_eval
 ```
+</tf>
+</frameworkcontent>
 ## Run a script with 🤗 Accelerate

--- a/docs/source/serialization.mdx
+++ b/docs/source/serialization.mdx
@@ -162,6 +162,8 @@ To export a model that's stored locally, you'll need to have the model's weights
 and tokenizer files stored in a directory. For example, we can load and save a
 checkpoint as follows:
+<frameworkcontent>
+<pt>
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -171,7 +173,17 @@ checkpoint as follows:
 >>> # Save to disk
 >>> tokenizer.save_pretrained("local-pt-checkpoint")
 >>> pt_model.save_pretrained("local-pt-checkpoint")
-===PT-TF-SPLIT===
+```
+Once the checkpoint is saved, we can export it to ONNX by pointing the `--model`
+argument of the `transformers.onnx` package to the desired directory:
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
+</pt>
+<tf>
+```python
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
 >>> # Load tokenizer and TensorFlow weights from the Hub
@@ -186,10 +198,10 @@ Once the checkpoint is saved, we can export it to ONNX by pointing the `--model`
 argument of the `transformers.onnx` package to the desired directory:
 ```bash
-python -m transformers.onnx --model=local-pt-checkpoint onnx/
-===PT-TF-SPLIT===
 python -m transformers.onnx --model=local-tf-checkpoint onnx/
 ```
+</tf>
+</frameworkcontent>
 ### Selecting features for different model topologies

--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
@@ -87,6 +87,8 @@ each other. The process is the following:
 4. Compute the softmax of the result to get probabilities over the classes.
 5. Print the results.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
 >>> import torch
@@ -122,8 +124,10 @@ is paraphrase: 90%
 ...     print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
 not paraphrase: 94%
 is paraphrase: 6%
+```
->>> # ===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
 >>> import tensorflow as tf
@@ -159,6 +163,8 @@ is paraphrase: 90%
 not paraphrase: 94%
 is paraphrase: 6%
 ```
+</tf>
+</frameworkcontent>
 ## Extractive Question Answering
@@ -214,6 +220,8 @@ Here is an example of question answering using a model and a tokenizer. The proc
 6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
 7. Print the results.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoTokenizer, AutoModelForQuestionAnswering
 >>> import torch
@@ -259,8 +267,10 @@ Question: What does 🤗 Transformers provide?
 Answer: general - purpose architectures
 Question: 🤗 Transformers provides interoperability between which frameworks?
 Answer: tensorflow 2. 0 and pytorch
+```
->>> # ===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
 >>> import tensorflow as tf
@@ -306,6 +316,8 @@ Answer: general - purpose architectures
 Question: 🤗 Transformers provides interoperability between which frameworks?
 Answer: tensorflow 2. 0 and pytorch
 ```
+</tf>
+</frameworkcontent>
 ## Language Modeling
@@ -382,6 +394,8 @@ Here is an example of doing masked language modeling using a model and a tokeniz
 5. Retrieve the top 5 tokens using the PyTorch `topk` or TensorFlow `top_k` methods.
 6. Replace the mask token by the tokens and print the results
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoModelForMaskedLM, AutoTokenizer
 >>> import torch
@@ -409,8 +423,10 @@ Distilled models are smaller than the models they mimic. Using them instead of t
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
+```
->>> # ===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import TFAutoModelForMaskedLM, AutoTokenizer
 >>> import tensorflow as tf
@@ -438,6 +454,8 @@ Distilled models are smaller than the models they mimic. Using them instead of t
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
 Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
 ```
+</tf>
+</frameworkcontent>
 This prints five sequences, with the top 5 tokens predicted by the model.
@@ -452,8 +470,10 @@ for generation tasks. If you would like to fine-tune a model on a causal languag
 Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the
 input sequence.
+<frameworkcontent>
+<pt>
 Here is an example of using the tokenizer and model and leveraging the
-[`PreTrainedModel.top_k_top_p_filtering`] method to sample the next token following an input sequence
+[`top_k_top_p_filtering`] method to sample the next token following an input sequence
 of tokens.
 ```py
@@ -484,8 +504,14 @@ of tokens.
 >>> resulting_string = tokenizer.decode(generated.tolist()[0])
 >>> print(resulting_string)
 Hugging Face is based in DUMBO, New York City, and ...
+```
+</pt>
+<tf>
+Here is an example of using the tokenizer and model and leveraging the
+[`tf_top_k_top_p_filtering`] method to sample the next token following an input sequence
+of tokens.
->>> # ===PT-TF-SPLIT===
+```py
 >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
 >>> import tensorflow as tf
@@ -512,6 +538,8 @@ Hugging Face is based in DUMBO, New York City, and ...
 >>> print(resulting_string)
 Hugging Face is based in DUMBO, New York City, and ...
 ```
+</tf>
+</frameworkcontent>
 This outputs a (hopefully) coherent next token following the original sequence, which in our case is the word *is* or
 *features*.
@@ -526,6 +554,8 @@ continuation from the given context. The following example shows how *GPT-2* can
 As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations
 (see [gpt-2 config](https://huggingface.co/gpt2/blob/main/config.json) for example).
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import pipeline
@@ -569,8 +599,10 @@ Below is an example of text generation using `XLNet` and its tokenizer, which in
 >>> print(generated)
 Today the weather is really nice and I am planning ...
+```
->>> # ===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import TFAutoModelForCausalLM, AutoTokenizer
 >>> model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
@@ -598,6 +630,8 @@ Today the weather is really nice and I am planning ...
 >>> print(generated)
 Today the weather is really nice and I am planning ...
 ```
+</tf>
+</frameworkcontent>
 Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in
 PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-XL* often
@@ -675,6 +709,8 @@ Here is an example of doing named entity recognition, using a model and a tokeni
   each token.
 6. Zip together each token with its prediction and print it.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoModelForTokenClassification, AutoTokenizer
 >>> import torch
@@ -692,7 +728,10 @@ Here is an example of doing named entity recognition, using a model and a tokeni
 >>> outputs = model(**inputs).logits
 >>> predictions = torch.argmax(outputs, dim=2)
->>> # ===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
 >>> import tensorflow as tf
@@ -710,6 +749,8 @@ Here is an example of doing named entity recognition, using a model and a tokeni
 >>> outputs = model(**inputs)[0]
 >>> predictions = tf.argmax(outputs, axis=2)
 ```
+</tf>
+</frameworkcontent>
 This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
 token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
@@ -816,6 +857,8 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
 In this example we use Google's T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including
 CNN / Daily Mail), it yields very good results.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
@@ -832,8 +875,10 @@ CNN / Daily Mail), it yields very good results.
 <pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
 counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
 between 1999 and 2002.</s>
+```
->>> # ===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
 >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
@@ -850,6 +895,8 @@ between 1999 and 2002.</s>
 counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
 between 1999 and 2002.
 ```
+</tf>
+</frameworkcontent>
 ## Translation
@@ -882,6 +929,8 @@ Here is an example of doing translation using a model and a tokenizer. The proce
 3. Add the T5 specific prefix "translate English to German: "
 4. Use the `PreTrainedModel.generate()` method to perform the translation.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
@@ -896,8 +945,10 @@ Here is an example of doing translation using a model and a tokenizer. The proce
 >>> print(tokenizer.decode(outputs[0]))
 <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
+```
->>> # ===PT-TF-SPLIT===
+</pt>
+<tf>
+```py
 >>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
 >>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
@@ -912,5 +963,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce
 >>> print(tokenizer.decode(outputs[0]))
 <pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
 ```
+</tf>
+</frameworkcontent>
 We get the same translation as with the pipeline example.
--- a/docs/source/tasks/language_modeling.mdx
+++ b/docs/source/tasks/language_modeling.mdx
@@ -157,6 +157,8 @@ Apply the `group_texts` function over the entire dataset:
 For causal language modeling, use [`DataCollatorForLanguageModeling`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient. 
+<frameworkcontent>
+<pt>
 You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:
 ```py
@@ -164,10 +166,6 @@ You can use the end of sequence token as the padding token, and set `mlm=False`.
 >>> tokenizer.pad_token = tokenizer.eos_token
 >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
-===PT-TF-SPLIT===
->>> from transformers import DataCollatorForLanguageModeling
->>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
 ```
 For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.
@@ -177,11 +175,26 @@ For masked language modeling, use the same [`DataCollatorForLanguageModeling`] e
 >>> tokenizer.pad_token = tokenizer.eos_token
 >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:
+```py
+>>> from transformers import DataCollatorForLanguageModeling
+>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
+```
+For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.
+```py
 >>> from transformers import DataCollatorForLanguageModeling
 >>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
 ```
+</tf>
+</frameworkcontent>
 ## Causal language modeling

--- a/docs/source/tasks/multiple_choice.mdx
+++ b/docs/source/tasks/multiple_choice.mdx
@@ -89,6 +89,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 `DataCollatorForMultipleChoice` will flatten all the model inputs, apply padding, and then unflatten the results:
+<frameworkcontent>
+<pt>
 ```py
 >>> from dataclasses import dataclass
 >>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
@@ -128,7 +130,10 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 ...         batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
 ...         batch["labels"] = torch.tensor(labels, dtype=torch.int64)
 ...         return batch
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from dataclasses import dataclass
 >>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
 >>> from typing import Optional, Union
@@ -168,6 +173,8 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
 ...         batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
 ...         return batch
 ```
+</tf>
+</frameworkcontent>
 ## Fine-tune with Trainer

--- a/docs/source/tasks/question_answering.mdx
+++ b/docs/source/tasks/question_answering.mdx
@@ -134,15 +134,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
 Use [`DefaultDataCollator`] to create a batch of examples. Unlike other data collators in 🤗 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import DefaultDataCollator
 >>> data_collator = DefaultDataCollator()
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import DefaultDataCollator
 >>> data_collator = DefaultDataCollator(return_tensors="tf")
 ```
+</tf>
+</frameworkcontent>
 ## Fine-tune with Trainer

--- a/docs/source/tasks/sequence_classification.mdx
+++ b/docs/source/tasks/sequence_classification.mdx
@@ -74,15 +74,22 @@ tokenized_imdb = imdb.map(preprocess_function, batched=True)
 Use [`DataCollatorWithPadding`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import DataCollatorWithPadding
 >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import DataCollatorWithPadding
 >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
 ```
+</tf>
+</frameworkcontent>
 ## Fine-tune with Trainer

--- a/docs/source/tasks/summarization.mdx
+++ b/docs/source/tasks/summarization.mdx
@@ -93,15 +93,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
 Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import DataCollatorForSeq2Seq
 >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import DataCollatorForSeq2Seq
 >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
 ```
+</tf>
+</frameworkcontent>
 ## Fine-tune with Trainer

--- a/docs/source/tasks/token_classification.mdx
+++ b/docs/source/tasks/token_classification.mdx
@@ -134,15 +134,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
 Use [`DataCollatorForTokenClassification`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import DataCollatorForTokenClassification
 >>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import DataCollatorForTokenClassification
 >>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")
 ```
+</tf>
+</frameworkcontent>
 ## Fine-tune with Trainer

--- a/docs/source/tasks/translation.mdx
+++ b/docs/source/tasks/translation.mdx
@@ -95,15 +95,22 @@ Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference
 Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
+<frameworkcontent>
+<pt>
 ```py
 >>> from transformers import DataCollatorForSeq2Seq
 >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
-===PT-TF-SPLIT===
+```
+</pt>
+<tf>
+```py
 >>> from transformers import DataCollatorForSeq2Seq
 >>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
 ```
+</tf>
+</frameworkcontent>
 ## Fine-tune with Trainer

--- a/utils/style_doc.py
+++ b/utils/style_doc.py
@@ -23,7 +23,6 @@ import black
 BLACK_AVOID_PATTERNS = {
-    "===PT-TF-SPLIT===": "### PT-TF-SPLIT",
    "{processor_class}": "FakeProcessorClass",
    "{model_class}": "FakeModelClass",
    "{object_class}": "FakeObjectClass",
@@ -192,8 +191,7 @@ def format_code_example(code: str, max_len: int, in_docstring: bool = False):
                in_decorator = True
        formatted_lines.extend([" " * indent + line for line in output.split("\n")])
-        if not output.endswith("===PT-TF-SPLIT==="):
+        formatted_lines.append("")
-            formatted_lines.append("")
    result = "\n".join(formatted_lines)
    return result.rstrip(), error