# Online Quantization Online quantization lets you take a BF16/FP16 model and quantize its Linear and MoE weights to lower precision (such as FP8) at load time, without needing a pre-quantized checkpoint or calibration data. Weights are converted during model loading and activations are dynamically scaled during each forward pass. ## Quick Start Pass a scheme name to the `quantization` parameter: ```python from vllm import LLM # Per-tensor FP8 quantization (one scale per weight tensor) llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_tensor") # Per-block FP8 quantization (128x128 block scaling for weights and 1x128 block scaling for activations) llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_block") ``` Or with the CLI: ```bash vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_tensor vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_block ``` ## Supported Schemes | Scheme | Weight recipe | Activation recipe | Notes | | ------ | ------------- | ------------------ | ----- | | `fp8_per_tensor` | fp8_e4m3 data, fp32 per-tensor scale | fp8_e4m3 data, fp32 per-tensor scale | On some GPUs (Ada, Hopper) linear activations use per-token scaling for better performance | | `fp8_per_block` | fp8_e4m3 data, fp32 per-128x128-block scale | fp8_e4m3 data, fp32 per-1x128-block scale | | Support for additional schemes will be added in future versions of vllm. ## Advanced Configuration For fine-grained control, use a `quantization_config` dictionary. ### Separate Schemes for Dense and MoE Layers You can apply different quantization schemes to dense linear layers and MoE expert layers: ```python from vllm import LLM llm = LLM( "ibm-granite/granite-3.0-1b-a400m-base", quantization="fp8_per_tensor", quantization_config={ "linear_scheme_override": "fp8_per_block", }, ) ``` Or, ```python from vllm import LLM llm = LLM( "ibm-granite/granite-3.0-1b-a400m-base", quantization="fp8_per_tensor", quantization_config={ "moe_scheme_override": "fp8_per_block", }, ) ``` ### Excluding Layers from Quantization Use the `ignore` parameter to skip specific layers. It accepts exact layer names and regex patterns (prefixed with `re:`): ```python from vllm import LLM llm = LLM( "ibm-granite/granite-3.0-1b-a400m-base", quantization="fp8_per_tensor", quantization_config={ "ignore": [ # exact layer name "model.layers.1.self_attn.o_proj", # regex: skip all QKV projections "re:.*[qkv]_proj", ], }, ) ``` !!! note For fused layers (e.g., `qkv_proj` which fuses `q_proj`, `k_proj`, `v_proj`), the ignore pattern must match the **unfused** shard names (`q_proj`, `k_proj`, `v_proj`), not the fused name.