Support for additional schemes will be added in future versions of vllm.
## Advanced Configuration
For fine-grained control, use a `quantization_config` dictionary.
### Separate Schemes for Dense and MoE Layers
You can apply different quantization schemes to dense linear layers and MoE expert layers:
```python
fromvllmimportLLM
llm=LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"linear_scheme_override":"fp8_per_block",
},
)
```
Or,
```python
fromvllmimportLLM
llm=LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"moe_scheme_override":"fp8_per_block",
},
)
```
### Excluding Layers from Quantization
Use the `ignore` parameter to skip specific layers. It accepts exact layer names and regex patterns (prefixed with `re:`):
```python
fromvllmimportLLM
llm=LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"ignore":[
# exact layer name
"model.layers.1.self_attn.o_proj",
# regex: skip all QKV projections
"re:.*[qkv]_proj",
],
},
)
```
!!! note
For fused layers (e.g., `qkv_proj` which fuses `q_proj`, `k_proj`, `v_proj`), the ignore pattern must match the **unfused** shard names (`q_proj`, `k_proj`, `v_proj`), not the fused name.