"docs/vscode:/vscode.git/clone" did not exist on "2390d44209d0dc8d9c52c5e05e9d57407d57b1d6"
Unverified Commit 5e5afafa authored by Vasiliy Kuznetsov's avatar Vasiliy Kuznetsov Committed by GitHub
Browse files

[Doc] add docs for online quant frontend (#39736)


Signed-off-by: default avatarVasiliy Kuznetsov <vasiliy@meta.com>
parent 324a3d2b
...@@ -16,6 +16,7 @@ The following are the supported quantization formats for vLLM: ...@@ -16,6 +16,7 @@ The following are the supported quantization formats for vLLM:
- [INT8 W8A8](int8.md) - [INT8 W8A8](int8.md)
- [FP8 W8A8](fp8.md) - [FP8 W8A8](fp8.md)
- [NVIDIA Model Optimizer](modelopt.md) - [NVIDIA Model Optimizer](modelopt.md)
- [Online Quantization](online.md)
- [AMD Quark](quark.md) - [AMD Quark](quark.md)
- [Quantized KV Cache](quantized_kvcache.md) - [Quantized KV Cache](quantized_kvcache.md)
- [TorchAO](torchao.md) - [TorchAO](torchao.md)
......
# Online Quantization
Online quantization lets you take a BF16/FP16 model and quantize its Linear
and MoE weights to lower precision (such as FP8) at load time, without needing
a pre-quantized checkpoint or calibration data. Weights are converted during
model loading and activations are dynamically scaled during each forward pass.
## Quick Start
Pass a scheme name to the `quantization` parameter:
```python
from vllm import LLM
# Per-tensor FP8 quantization (one scale per weight tensor)
llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_tensor")
# Per-block FP8 quantization (128x128 block scaling for weights and 1x128 block scaling for activations)
llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_block")
```
Or with the CLI:
```bash
vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_tensor
vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_block
```
## Supported Schemes
| Scheme | Weight recipe | Activation recipe | Notes |
| ------ | ------------- | ------------------ | ----- |
| `fp8_per_tensor` | fp8_e4m3 data, fp32 per-tensor scale | fp8_e4m3 data, fp32 per-tensor scale | On some GPUs (Ada, Hopper) linear activations use per-token scaling for better performance |
| `fp8_per_block` | fp8_e4m3 data, fp32 per-128x128-block scale | fp8_e4m3 data, fp32 per-1x128-block scale | |
Support for additional schemes will be added in future versions of vllm.
## Advanced Configuration
For fine-grained control, use a `quantization_config` dictionary.
### Separate Schemes for Dense and MoE Layers
You can apply different quantization schemes to dense linear layers and MoE expert layers:
```python
from vllm import LLM
llm = LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"linear_scheme_override": "fp8_per_block",
},
)
```
Or,
```python
from vllm import LLM
llm = LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"moe_scheme_override": "fp8_per_block",
},
)
```
### Excluding Layers from Quantization
Use the `ignore` parameter to skip specific layers. It accepts exact layer names and regex patterns (prefixed with `re:`):
```python
from vllm import LLM
llm = LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"ignore": [
# exact layer name
"model.layers.1.self_attn.o_proj",
# regex: skip all QKV projections
"re:.*[qkv]_proj",
],
},
)
```
!!! note
For fused layers (e.g., `qkv_proj` which fuses `q_proj`, `k_proj`, `v_proj`), the ignore pattern must match the **unfused** shard names (`q_proj`, `k_proj`, `v_proj`), not the fused name.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment