model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
2. Applying Quantization
^^^^^^^^^^^^^^^^^^^^^^^^
In the output of the above script, you should be able to see the quantized Linear modules (FP8DynamicLinear) replaced in the model definition.
Note that the ``lm_head`` Linear module at the end is currently skipped by default.
For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:
.. code-block:: text
- Static, per-channel quantization on the weights
- Dynamic, per-token quantization on the activations
Saving the model to Meta-Llama-3-8B-Instruct-FP8-Dynamic
Your model checkpoint with quantized weights should be available at ``Meta-Llama-3-8B-Instruct-FP8/``.
We can see that the weights are smaller than the original BF16 precision.
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
.. code-block:: bash
.. code-block:: python
ls -lh Meta-Llama-3-8B-Instruct-FP8-Dynamic/
total 8.5G
-rw-rw-r-- 1 user user 869 Jun 7 14:43 config.json
-rw-rw-r-- 1 user user 194 Jun 7 14:43 generation_config.json
-rw-rw-r-- 1 user user 4.7G Jun 7 14:43 model-00001-of-00002.safetensors
-rw-rw-r-- 1 user user 3.9G Jun 7 14:43 model-00002-of-00002.safetensors
-rw-rw-r-- 1 user user 43K Jun 7 14:43 model.safetensors.index.json
-rw-rw-r-- 1 user user 296 Jun 7 14:43 special_tokens_map.json
-rw-rw-r-- 1 user user 50K Jun 7 14:43 tokenizer_config.json
-rw-rw-r-- 1 user user 8.7M Jun 7 14:43 tokenizer.json
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
Finally, you can load the quantized model checkpoint directly in vLLM.
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8-Dynamic/")
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
result = model.generate("Hello, my name is")
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
model.generate("Hello my name is")
Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
.. note::
Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
For the best inference performance, you can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.