Unverified Commit de139702 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

[`LlamaFamiliy`] add a tip about dtype (#25794)



* add a warning=True tip to the Llama2 doc

* code llama needs a tip too

* doc nit

* build PR doc

* doc nits
Co-authored-by: default avatarLysandre <lysandre@huggingface.co>

---------
Co-authored-by: default avatarLysandre <lysandre@huggingface.co>
parent 686c68f6
...@@ -26,6 +26,16 @@ The abstract from the paper is the following: ...@@ -26,6 +26,16 @@ The abstract from the paper is the following:
Checkout all CodeLlama models [here](https://huggingface.co/models?search=code_llama) Checkout all CodeLlama models [here](https://huggingface.co/models?search=code_llama)
<Tip warning={true}>
The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16. The checkpoints uploaded on the hub use `torch_dtype = 'float16'` which will be used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`) and finally, if there is a `torch_dtype` provided in the config, it will be used.
Training the model in `float16` is not recommended and known to produce `nan`, as such the model should be trained in `bfloat16`.
</Tip>
Tips: Tips:
- These models have the same architcture as the `Llama2` models - These models have the same architcture as the `Llama2` models
...@@ -75,8 +85,8 @@ If you only want the infilled part: ...@@ -75,8 +85,8 @@ If you only want the infilled part:
>>> from transformers import pipeline >>> from transformers import pipeline
>>> import torch >>> import torch
>>> pipeline = pipeline("text-generation",model="codellama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto") >>> generator = pipeline("text-generation",model="codellama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto")
>>> pipeline('def remove_non_ascii(s: str) -> str:\n """ <FILL_ME>\n return result', max_new_tokens = 128, return_type = 1) >>> generator('def remove_non_ascii(s: str) -> str:\n """ <FILL_ME>\n return result', max_new_tokens = 128, return_type = 1)
``` ```
Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions
come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it's thus 145GB of RAM needed. come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it's thus 145GB of RAM needed.
......
...@@ -26,6 +26,17 @@ The abstract from the paper is the following: ...@@ -26,6 +26,17 @@ The abstract from the paper is the following:
Checkout all Llama2 models [here](https://huggingface.co/models?search=llama2) Checkout all Llama2 models [here](https://huggingface.co/models?search=llama2)
<Tip warning={true}>
The `Llama2` models were trained using `bfloat16`, but the original inference uses `float16. The checkpoints uploaded on the hub use `torch_dtype = 'float16'` which will be
used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`) and finally, if there is a `torch_dtype` provided in the config, it will be used.
Training the model in `float16` is not recommended and known to produce `nan`, as such the model should be trained in `bfloat16`.
</Tip>
Tips: Tips:
- Weights for the Llama2 models can be obtained by filling out [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) - Weights for the Llama2 models can be obtained by filling out [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment