Note that `device_map` is optional but setting `device_map = 'auto'` is prefered for inference as it will dispatch efficiently the model on the available ressources.
### Running FP4 models - multi GPU setup
### Running FP4 models - multi GPU setup
The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
The way to load your mixed 4-bit model in multiple GPUs is as follows (same command as single GPU setup):
Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature.
Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature.
Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).
### Requirements
...
...
@@ -174,7 +174,7 @@ In this example, the first GPU will use 1GB of memory and the second 2GB.
### Colab demos
With this method you can infer on models that were not possible to infer on a Google Colab before.
With this method you can infer on models that were not possible to infer on a Google Colab before.
Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:
[](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)