@@ -19,24 +19,6 @@ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada L
...
@@ -19,24 +19,6 @@ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada L
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::
:::
## Quick Start with Online Dynamic Quantization
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
```python
fromvllmimportLLM
model=LLM("facebook/opt-125m",quantization="fp8")
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result=model.generate("Hello, my name is")
print(result[0].outputs[0].text)
```
:::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
:::
## Installation
## Installation
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
...
@@ -45,12 +27,6 @@ To produce performant FP8 quantized models with vLLM, you'll need to install the
...
@@ -45,12 +27,6 @@ To produce performant FP8 quantized models with vLLM, you'll need to install the
pip install llmcompressor
pip install llmcompressor
```
```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console
pip install vllm lm-eval==0.4.4
```
## Quantization Process
## Quantization Process
The quantization process involves three main steps:
The quantization process involves three main steps:
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
## Offline Quantization with Static Activation Scaling Factors
You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the `activation_scheme="static"` argument.
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
tokenizer.pad_token=tokenizer.eos_token
# Load and tokenize 512 dataset samples for calibration of activation scales
Your model checkpoint with quantized weights and activations should be available at `Meta-Llama-3-8B-Instruct-FP8/`.
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
Finally, you can load the quantized model checkpoint directly in vLLM.
```python
```python
fromvllmimportLLM
fromvllmimportLLM
model=LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
model=LLM("facebook/opt-125m",quantization="fp8")
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result=model.generate("Hello, my name is")
result=model.generate("Hello, my name is")
print(result[0].outputs[0].text)
print(result[0].outputs[0].text)
```
```
:::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.