[Doc] add docs for online quant frontend (#39736)

Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

[Doc] add docs for online quant frontend (#39736)
Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
5e5afafa · Vasiliy Kuznetsov · GitHub · 324a3d2b · 5e5afafa · 5e5afafa
Unverified Commit 5e5afafa authored Apr 16, 2026 by Vasiliy Kuznetsov Committed by GitHub Apr 16, 2026
Show whitespace changes
Inline Side-by-side

Showing with 95 additions and 0 deletions

docs/features/quantization/README.md docs/features/quantization/README.md +1 -0

docs/features/quantization/online.md docs/features/quantization/online.md +94 -0

No files found.
--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@@ -16,6 +16,7 @@ The following are the supported quantization formats for vLLM:
 - [INT8 W8A8](int8.md)
 - [FP8 W8A8](fp8.md)
 - [NVIDIA Model Optimizer](modelopt.md)
+- [Online Quantization](online.md)
 - [AMD Quark](quark.md)
 - [Quantized KV Cache](quantized_kvcache.md)
 - [TorchAO](torchao.md)

--- a/docs/features/quantization/online.md
+++ b/docs/features/quantization/online.md
+# Online Quantization
+Online quantization lets you take a BF16/FP16 model and quantize its Linear
+and MoE weights to lower precision (such as FP8) at load time, without needing
+a pre-quantized checkpoint or calibration data. Weights are converted during
+model loading and activations are dynamically scaled during each forward pass.
+## Quick Start
+Pass a scheme name to the `quantization` parameter:
+```python
+from vllm import LLM
+# Per-tensor FP8 quantization (one scale per weight tensor)
+llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_tensor")
+# Per-block FP8 quantization (128x128 block scaling for weights and 1x128 block scaling for activations)
+llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_block")
+```
+Or with the CLI:
+```bash
+vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_tensor
+vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_block
+```
+## Supported Schemes
+| Scheme | Weight recipe | Activation recipe | Notes |
+| ------ | ------------- | ------------------ | ----- |
+| `fp8_per_tensor` | fp8_e4m3 data, fp32 per-tensor scale | fp8_e4m3 data, fp32 per-tensor scale | On some GPUs (Ada, Hopper) linear activations use per-token scaling for better performance |
+| `fp8_per_block` | fp8_e4m3 data, fp32 per-128x128-block scale | fp8_e4m3 data, fp32 per-1x128-block scale | |
+Support for additional schemes will be added in future versions of vllm.
+## Advanced Configuration
+For fine-grained control, use a `quantization_config` dictionary.
+### Separate Schemes for Dense and MoE Layers
+You can apply different quantization schemes to dense linear layers and MoE expert layers:
+```python
+from vllm import LLM
+llm = LLM(
+    "ibm-granite/granite-3.0-1b-a400m-base",
+    quantization="fp8_per_tensor",
+    quantization_config={
+        "linear_scheme_override": "fp8_per_block",
+    },
+)
+```
+Or,
+```python
+from vllm import LLM
+llm = LLM(
+    "ibm-granite/granite-3.0-1b-a400m-base",
+    quantization="fp8_per_tensor",
+    quantization_config={
+        "moe_scheme_override": "fp8_per_block",
+    },
+)
+```
+### Excluding Layers from Quantization
+Use the `ignore` parameter to skip specific layers. It accepts exact layer names and regex patterns (prefixed with `re:`):
+```python
+from vllm import LLM
+llm = LLM(
+    "ibm-granite/granite-3.0-1b-a400m-base",
+    quantization="fp8_per_tensor",
+    quantization_config={
+        "ignore": [
+            # exact layer name
+            "model.layers.1.self_attn.o_proj",
+            # regex: skip all QKV projections
+            "re:.*[qkv]_proj",
+        ],
+    },
+)
+```
+!!! note
+    For fused layers (e.g., `qkv_proj` which fuses `q_proj`, `k_proj`, `v_proj`), the ignore pattern must match the **unfused** shard names (`q_proj`, `k_proj`, `v_proj`), not the fused name.