model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
^^^^^^^^^^^^^^^^^^^^^^^^
model.quantize(examples)
model.save_quantized(quantized_model_dir)
In the output of the above script, you should be able to see the quantized Linear modules (FP8DynamicLinear) replaced in the model definition.
For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:
Note that the ``lm_head`` Linear module at the end is currently skipped by default.
.. code-block:: text
- Static, per-channel quantization on the weights
- Dynamic, per-token quantization on the activations
LlamaForCausalLM(
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8-Dynamic/")
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
model.generate("Hello my name is")
result = model.generate("Hello, my name is")
Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
.. note::
Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
For the best inference performance, you can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.