New model sharing tutorial (#5323)

1af58c07 · Sylvain Gugger · GitHub · efae6645 · 1af58c07 · efae6645
Unverified Commit 1af58c07 authored Jun 27, 2020 by Sylvain Gugger Committed by GitHub Jun 27, 2020
5 changed files
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -139,9 +139,8 @@ conversion utilities for the following models:
    task_summary
    model_summary
-    training
    preprocessing
-    serialization
+    training
    model_sharing
    multilingual

--- a/docs/source/model_sharing.md
+++ b/docs/source/model_sharing.md
-# Model upload and sharing
-Starting with `v2.2.2`, you can now upload and share your fine-tuned models with the community, using the <abbr title="Command-line interface">CLI</abbr> that's built-in to the library.
-**First, create an account on [https://huggingface.co/join](https://huggingface.co/join)**. Optionally, join an existing organization or create a new one. Then:
-```shell
-transformers-cli login
-# log in using the same credentials as on huggingface.co
-```
-Upload your model:
-```shell
-transformers-cli upload ./path/to/pretrained_model/
-# ^^ Upload folder containing weights/tokenizer/config
-# saved via `.save_pretrained()`
-transformers-cli upload ./config.json [--filename folder/foobar.json]
-# ^^ Upload a single file
-# (you can optionally override its filename, which can be nested inside a folder)
-```
-If you want your model to be namespaced by your organization name rather than your username, add the following flag to any command:
-```shell
--organization organization_name
-```
-Your model will then be accessible through its identifier, a concatenation of your username (or organization name) and the folder name above:
-```python
-"username/pretrained_model"
-# or if an org:
-"organization_name/pretrained_model"
-```
-**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hardware used, hyperparameters), evaluation results, intended uses & limitations, etc.
-Your model now has a page on huggingface.co/models 🔥
-Anyone can load it from code:
-```python
-tokenizer = AutoTokenizer.from_pretrained("namespace/pretrained_model")
-model = AutoModel.from_pretrained("namespace/pretrained_model")
-```
-List all your files on S3:
-```shell
-transformers-cli s3 ls
-```
-You can also delete unneeded files:
-```shell
-transformers-cli s3 rm …
-```
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
+Model sharing and uploading
+===========================
+In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
+the `model hub <https://huggingface.co/models>`__.
+.. note::
+    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
+    Optionally, you can join an existing organization or create a new one.
+Prepare your model for uploading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
+done something similar on your task, either using the model directly in your own training loop or using the
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
+the `model hub <https://huggingface.co/models>`__.
+Basic steps
+^^^^^^^^^^^
+.. 
+    When #5258 is merged, we can remove the need to create the directory.
+First, pick a directory with the name you want your model to have on the model hub (its full name will then be
+`username/awesome-name-you-picked` of `organization/awesome-name-you-picked`) and create it with either
+::
+    mkdir path/to/awesome-name-you-picked
+or in python
+::
+    import os
+    os.makedirs("path/to/awesome-name-you-picked")
+then you can save your model and tokenizer with:
+::
+    model.save_pretrained("path/to/awesome-name-you-picked")
+    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
+Or, if you're using the Trainer API
+::
+    trainer.save_model("path/to/awesome-name-you-picked")
+    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
+Make your model work on all frameworks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. 
+    TODO Sylvain: make this automatic during the upload
+You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
+PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
+your model in another framework, but it will be slower). Don't worry, it's super easy to do (and in a future version,
+it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
+worry about the GPU, so it should be very easy. Check the
+`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
+and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+First check that your model class exists in the other framework, that is try to import the same model by either adding
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
+type
+::
+    from transformers import TFDistilBertForSequenceClassification
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
+type
+::
+    from transformers import DistilBertForSequenceClassification
+This will give back an error if your model does not exist in the other framework (something that should be pretty rare
+since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
+Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
+model class:
+::
+    tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
+    tf_model.save_pretrained("path/to/awesome-name-you-picked")
+and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
+model class:
+::
+    pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
+    pt_model.save_pretrained("path/to/awesome-name-you-picked")
+That's all there is to it!
+Check the directory before uploading
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Make sure there are no garbage files in the directory you'll upload. It should only have:
+- a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
+- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
+- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
+- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>`
+  save;
+- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
+Other files can safely be deleted.
+Upload your model with the CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
+Transformers, since that command :obj:`transformers-cli` comes from the library.
+::
+    transformers-cli login
+Then log in using the same credentials as on huggingface.co. To upload your model, just type
+::
+    transformers-cli upload path/to/awesome-name-you-picked/
+This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
+If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
+just type:
+::
+    transformers-cli upload path/to/awesome-name-you-picked/that-file 
+or
+::
+   transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
+if you want to change its filename.
+This uploads the model to your personal account. If you want your model to be namespaced by your organization name
+rather than your username, add the following flag to any command:
+::
+    --organization organization_name
+so for instance:
+::
+    transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
+Your model will then be accessible through its identifier, which is, as we saw above,
+`username/awesome-name-you-picked` of `organization/awesome-name-you-picked`.
+Add a model card
+^^^^^^^^^^^^^^^^
+To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
+considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should be named
+`awesome-name-you-picked-READMED.md` and follow `this template <https://github.com/huggingface/model_card>`__.
+If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
+don't forget to link to its model card so that people can fully trace how your model was built.
+If you have never made a pull request to the 🤗 Transformers repo, look at the
+:doc:`contributing guide <contributing>` to see the steps to follow.
+Using your model
+^^^^^^^^^^^^^^^^
+Your model now has a page on huggingface.co/models 🔥
+Anyone can load it from code:
+::
+    tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
+    model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
+Additional commands
+^^^^^^^^^^^^^^^^^^^
+You can list all the files you uploaded on the hub like this:
+::
+    transformers-cli s3 ls
+You can also delete unneeded files with
+::
+    transformers-cli s3 rm awesome-name-you-picked/filename
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -282,7 +282,7 @@ Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#to
 `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual
 training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if
 you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed
-precision, etc.). See the training tutorial (coming soon) for more details.
+precision, etc.). See the :doc:`training tutorial <training>` for more details.
 Once your model is fine-tuned, you can save it with its tokenizer the following way:

--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
-Serialization best-practices
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
-There are three types of files you need to save to be able to reload a fine-tuned model:
-* the model itself which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
-* the configuration file of the model which is saved as a JSON file, and
-* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
-The *default filenames* of these files are as follow:
-* the model weights file: ``pytorch_model.bin``\ ,
-* the configuration file: ``config.json``\ ,
-* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
-* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
-**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
-Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
-.. code-block:: python
-   from transformers import WEIGHTS_NAME, CONFIG_NAME
-   output_dir = "./models/"
-   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-   # If we have a distributed model, save only the encapsulated model
-   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-   model_to_save = model.module if hasattr(model, 'module') else model
-   # If we save using the predefined names, we can load using `from_pretrained`
-   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
-   output_config_file = os.path.join(output_dir, CONFIG_NAME)
-   torch.save(model_to_save.state_dict(), output_model_file)
-   model_to_save.config.to_json_file(output_config_file)
-   tokenizer.save_pretrained(output_dir)
-   # Step 2: Re-load the saved model and vocabulary
-   # Example for a Bert model
-   model = BertForQuestionAnswering.from_pretrained(output_dir)
-   tokenizer = BertTokenizer.from_pretrained(output_dir)  # Add specific options if needed
-   # Example for a GPT model
-   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
-   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
-Here is another way you can save and reload the model if you want to use specific paths for each type of files:
-.. code-block:: python
-   output_model_file = "./models/my_own_model_file.bin"
-   output_config_file = "./models/my_own_config_file.bin"
-   output_vocab_file = "./models/my_own_vocab_file.bin"
-   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-   # If we have a distributed model, save only the encapsulated model
-   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-   model_to_save = model.module if hasattr(model, 'module') else model
-   torch.save(model_to_save.state_dict(), output_model_file)
-   model_to_save.config.to_json_file(output_config_file)
-   tokenizer.save_vocabulary(output_vocab_file)
-   # Step 2: Re-load the saved model and vocabulary
-   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
-   # Here is how to do it in this situation:
-   # Example for a Bert model
-   config = BertConfig.from_json_file(output_config_file)
-   model = BertForQuestionAnswering(config)
-   state_dict = torch.load(output_model_file)
-   model.load_state_dict(state_dict)
-   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
-   # Example for a GPT model
-   config = OpenAIGPTConfig.from_json_file(output_config_file)
-   model = OpenAIGPTDoubleHeadsModel(config)
-   state_dict = torch.load(output_model_file)
-   model.load_state_dict(state_dict)
-   tokenizer = OpenAIGPTTokenizer(output_vocab_file)