Docs: 4 bit doc corrections (#24572)

4 bit doc corrections

Docs: 4 bit doc corrections (#24572)
4 bit doc corrections
4f1b31c2 · Joao Gante · GitHub · 1fd52e6e · 4f1b31c2
Unverified Commit 4f1b31c2 authored Jun 29, 2023 by Joao Gante Committed by GitHub Jun 29, 2023
Show whitespace changes
Inline Side-by-side

Showing with 9 additions and 9 deletions

docs/source/en/perf_infer_gpu_one.md docs/source/en/perf_infer_gpu_one.md +9 -9

No files found.
--- a/docs/source/en/perf_infer_gpu_one.md
+++ b/docs/source/en/perf_infer_gpu_one.md
@@ -67,23 +67,23 @@ You can quickly run a FP4 model on a single GPU by running the following code:
 from transformers import AutoModelForCausalLM

 model_name = "bigscience/bloom-2b5"
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
+model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
 ```
 Note that `device_map` is optional but setting `device_map = 'auto'` is prefered for inference as it will dispatch efficiently the model on the available ressources.

 ### Running FP4 models - multi GPU setup

-The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
+The way to load your mixed 4-bit model in multiple GPUs is as follows (same command as single GPU setup):
 ```py
 model_name = "bigscience/bloom-2b5"
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
+model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
 ```
 But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows:

 ```py
 max_memory_mapping = {0: "600MB", 1: "1GB"}
 model_name = "bigscience/bloom-3b"
-model_8bit = AutoModelForCausalLM.from_pretrained(
+model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping
 )
 ```