Commit 3fb4b5fa authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.18.0' into v0.18.0-ori

parents bcf25339 89138b21
...@@ -6,7 +6,7 @@ A Ray cluster can be declared in YAML, and the operator then handles pod schedul ...@@ -6,7 +6,7 @@ A Ray cluster can be declared in YAML, and the operator then handles pod schedul
## Why KubeRay instead of manual scripts? ## Why KubeRay instead of manual scripts?
| Feature | Manual scripts | KubeRay | | Feature | Manual scripts | KubeRay |
|---------|-----------------------------------------------------------|---------| | ------- | --------------------------------------------------------- | ------- |
| Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` | | Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` |
| Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size | | Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size |
| Upgrades | Tear down & re-create manually | Blue/green deployment updates supported | | Upgrades | Tear down & re-create manually | Blue/green deployment updates supported |
......
...@@ -11,6 +11,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le ...@@ -11,6 +11,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le
Alternatively, you can deploy vLLM to Kubernetes using any of the following: Alternatively, you can deploy vLLM to Kubernetes using any of the following:
- [Helm](frameworks/helm.md) - [Helm](frameworks/helm.md)
- [NVIDIA Dynamo](integrations/dynamo.md)
- [InftyAI/llmaz](integrations/llmaz.md) - [InftyAI/llmaz](integrations/llmaz.md)
- [llm-d](integrations/llm-d.md) - [llm-d](integrations/llm-d.md)
- [KAITO](integrations/kaito.md) - [KAITO](integrations/kaito.md)
...@@ -20,7 +21,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: ...@@ -20,7 +21,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
- [kubernetes-sigs/lws](frameworks/lws.md) - [kubernetes-sigs/lws](frameworks/lws.md)
- [meta-llama/llama-stack](integrations/llamastack.md) - [meta-llama/llama-stack](integrations/llamastack.md)
- [substratusai/kubeai](integrations/kubeai.md) - [substratusai/kubeai](integrations/kubeai.md)
- [vllm-project/aibrix](https://github.com/vllm-project/aibrix) - [vllm-project/AIBrix](integrations/aibrix.md)
- [vllm-project/production-stack](integrations/production-stack.md) - [vllm-project/production-stack](integrations/production-stack.md)
## Deployment with CPUs ## Deployment with CPUs
......
...@@ -119,10 +119,10 @@ The code can be found in [vllm/v1/engine/coordinator.py](../../vllm/v1/engine/co ...@@ -119,10 +119,10 @@ The code can be found in [vllm/v1/engine/coordinator.py](../../vllm/v1/engine/co
For a deployment with `N` GPUs, `TP` tensor parallel size, `DP` data parallel size, and `A` API server count: For a deployment with `N` GPUs, `TP` tensor parallel size, `DP` data parallel size, and `A` API server count:
| Process Type | Count | Notes | | Process Type | Count | Notes |
|---|---|---| | - | - | - |
| API Server | `A` (default `DP`) | Handles HTTP requests and input processing | | API Server | `A` (default `DP`) | Handles HTTP requests and input processing |
| Engine Core | `DP` (default 1) | Scheduler and KV cache management | | Engine Core | `DP` (default 1) | Scheduler and KV cache management |
| GPU Worker | `N` (= `DP x TP`) | One per GPU, executes model forward passes | | GPU Worker | `N` (= `DP x PP x TP`) | One per GPU, executes model forward passes |
| DP Coordinator | 1 if `DP > 1`, else 0 | Load balancing across DP ranks | | DP Coordinator | 1 if `DP > 1`, else 0 | Load balancing across DP ranks |
| **Total** | **`A + DP + N` (+ 1 if DP > 1)** | | | **Total** | **`A + DP + N` (+ 1 if DP > 1)** | |
...@@ -208,9 +208,7 @@ configurations affect the class we ultimately get. ...@@ -208,9 +208,7 @@ configurations affect the class we ultimately get.
The following figure shows the class hierarchy of vLLM: The following figure shows the class hierarchy of vLLM:
> <figure markdown="span"> ![Class Hierarchy](../assets/design/hierarchy.png)
> ![](../assets/design/hierarchy.png){ align="center" alt="query" width="100%" }
> </figure>
There are several important design choices behind this class hierarchy: There are several important design choices behind this class hierarchy:
......
...@@ -101,7 +101,7 @@ Priority is **1 = highest** (tried first). ...@@ -101,7 +101,7 @@ Priority is **1 = highest** (tried first).
**Blackwell (SM 10.x):** **Blackwell (SM 10.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASHINFER` | | 1 | `FLASHINFER` |
| 2 | `FLASH_ATTN` | | 2 | `FLASH_ATTN` |
| 3 | `TRITON_ATTN` | | 3 | `TRITON_ATTN` |
...@@ -110,7 +110,7 @@ Priority is **1 = highest** (tried first). ...@@ -110,7 +110,7 @@ Priority is **1 = highest** (tried first).
**Ampere/Hopper (SM 8.x-9.x):** **Ampere/Hopper (SM 8.x-9.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASH_ATTN` | | 1 | `FLASH_ATTN` |
| 2 | `FLASHINFER` | | 2 | `FLASHINFER` |
| 3 | `TRITON_ATTN` | | 3 | `TRITON_ATTN` |
...@@ -121,18 +121,19 @@ Priority is **1 = highest** (tried first). ...@@ -121,18 +121,19 @@ Priority is **1 = highest** (tried first).
**Blackwell (SM 10.x):** **Blackwell (SM 10.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASHINFER_MLA` | | 1 | `FLASHINFER_MLA` |
| 2 | `CUTLASS_MLA` | | 2 | `CUTLASS_MLA` |
| 3 | `FLASH_ATTN_MLA` | | 3 | `FLASH_ATTN_MLA` |
| 4 | `FLASHMLA` | | 4 | `FLASHMLA` |
| 5 | `TRITON_MLA` | | 5 | `TRITON_MLA` |
| 6 | `FLASHMLA_SPARSE` | | 6 | `FLASHMLA_SPARSE` |
| 7 | `FLASHINFER_MLA_SPARSE` |
**Ampere/Hopper (SM 8.x-9.x):** **Ampere/Hopper (SM 8.x-9.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASH_ATTN_MLA` | | 1 | `FLASH_ATTN_MLA` |
| 2 | `FLASHMLA` | | 2 | `FLASHMLA` |
| 3 | `FLASHINFER_MLA` | | 3 | `FLASHINFER_MLA` |
...@@ -144,7 +145,7 @@ Priority is **1 = highest** (tried first). ...@@ -144,7 +145,7 @@ Priority is **1 = highest** (tried first).
## Legend ## Legend
| Column | Description | | Column | Description |
|--------|-------------| | ------ | ----------- |
| **Dtypes** | Supported model data types (fp16, bf16, fp32) | | **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) | | **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) | | **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
...@@ -152,6 +153,7 @@ Priority is **1 = highest** (tried first). ...@@ -152,6 +153,7 @@ Priority is **1 = highest** (tried first).
| **Sink** | Attention sink support (for StreamingLLM) | | **Sink** | Attention sink support (for StreamingLLM) |
| **Sparse** | Sparse attention support (MLA only) | | **Sparse** | Sparse attention support (MLA only) |
| **MM Prefix** | Multimodal prefix full attention support | | **MM Prefix** | Multimodal prefix full attention support |
| **DCP** | Decode Context Parallelism support (`--decode-context-parallel-size`) |
| **Attention Types** | Supported attention patterns (Decoder, Encoder, Enc-Dec) | | **Attention Types** | Supported attention patterns (Decoder, Encoder, Enc-Dec) |
| **Compute Cap.** | Required CUDA compute capability (N/A for non-CUDA backends) | | **Compute Cap.** | Required CUDA compute capability (N/A for non-CUDA backends) |
...@@ -159,24 +161,25 @@ Priority is **1 = highest** (tried first). ...@@ -159,24 +161,25 @@ Priority is **1 = highest** (tried first).
## Standard Attention (MHA, MQA, GQA) Backends ## Standard Attention (MHA, MQA, GQA) Backends
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | Attention Types | Compute Cap. | | Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----------------|--------------| | ------- | ------- | ------ | --------- | ----------- | ---------- | ---- | --------- | --- | --------------- | ------------ |
| `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | All | N/A | | `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
| `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | Decoder | 7.x-9.x | | `FLASHINFER` | Native† | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
| `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | Decoder | 10.x | | `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
| `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | All | ≥8.0 | | `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `float16`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
| `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | All | 9.x | | `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | Decoder | Any | | `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `float16`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 |
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | Decoder, Encoder Only | Any | | `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
| `ROCM_AITER_FA` | | fp16, bf16 | `auto` | 16, 32 | 64, 128, 256 | ❌ | ❌ | Decoder | N/A | | `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `float16`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
| `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder, Enc-Dec | N/A |
| `ROCM_ATTN` | | fp16, bf16, fp32 | `auto` | 16, 32, 544 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | %16 | Any | ✅ | ✅ | ❌ | All | N/A |
| `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | Decoder | Any | | `ROCM_ATTN` | | fp16, bf16, fp32 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ✅ | ✅ | ❌ | All | N/A |
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | All | Any | | `TREE_ATTN` | | fp16, bf16 | `auto`, `float16`, `bfloat16` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`. > **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
> >
> **\*** Specify the FlashAttention version via `--attention-config.flash_attn_version=2` or `3`. Default is FA3 on SM90, FA2 otherwise. > **\*** Specify the FlashAttention version via `--attention-config.flash_attn_version=2`, `3`, or `4`. Default is FA4 on SM100+ (Blackwell), FA3 on SM90 (Hopper), FA2 otherwise.
## MLA (Multi-head Latent Attention) Backends ## MLA (Multi-head Latent Attention) Backends
...@@ -188,10 +191,10 @@ The prefill backend is selected at runtime based on hardware and ...@@ -188,10 +191,10 @@ The prefill backend is selected at runtime based on hardware and
configuration. configuration.
| Backend | Description | Compute Cap. | Enable | Disable | Notes | | Backend | Description | Compute Cap. | Enable | Disable | Notes |
|---------|-------------|--------------|--------|---------|-------| | ------- | ----------- | ------------ | ------ | ------- | ----- |
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only | | TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only |
| FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only | | FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only |
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | | | cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | |
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise | | FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |
> **‡** TRT-LLM Ragged is the default on Blackwell (SM100). > **‡** TRT-LLM Ragged is the default on Blackwell (SM100).
...@@ -199,14 +202,16 @@ configuration. ...@@ -199,14 +202,16 @@ configuration.
### Decode Backends ### Decode Backends
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | Attention Types | Compute Cap. | | Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----------------|--------------| | ------- | ------ | --------- | ----------- | ---------- | ---- | ------ | --------- | --- | --------------- | ------------ |
| `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | Decoder | 10.x | | `CUTLASS_MLA` | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
| `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | Decoder | 10.x | | `FLASHINFER_MLA` | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
| `FLASHMLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 64 | Any | ❌ | ❌ | ❌ | Decoder | 9.x-10.x | | `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |
| `FLASHMLA_SPARSE` | bf16 | `auto`, `bfloat16`, `fp8_ds_mla` | 64 | 576 | ❌ | ✅ | ❌ | Decoder | 9.x-10.x | | `FLASHMLA` | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3` | 64 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x-10.x |
| `FLASH_ATTN_MLA` | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ❌ | Decoder | 9.x | | `FLASHMLA_SPARSE` | bf16 | `auto`, `bfloat16`, `fp8_ds_mla` | 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 9.x-10.x |
| `ROCM_AITER_MLA` | fp16, bf16 | `auto` | 1 | Any | ❌ | ❌ | ❌ | Decoder | N/A | | `FLASH_ATTN_MLA` | fp16, bf16 | `auto`, `float16`, `bfloat16` | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x |
| `ROCM_AITER_MLA_SPARSE` | fp16, bf16 | `auto` | Any | 576 | ❌ | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_MLA` | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 1 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `ROCM_AITER_TRITON_MLA` | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | Decoder | N/A | | `ROCM_AITER_MLA_SPARSE` | fp16, bf16 | `auto`, `float16`, `bfloat16` | 1 | Any | ❌ | ✅ | ❌ | ❌ | Decoder | N/A |
| `TRITON_MLA` | fp16, bf16 | `auto`, `bfloat16` | Any | Any | ❌ | ❌ | ❌ | Decoder | Any | | `ROCM_AITER_TRITON_MLA` | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
| `TRITON_MLA` | fp16, bf16 | `auto`, `float16`, `bfloat16`, `fp8`, `fp8_e4m3` | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | Any |
| `XPU_MLA_SPARSE` | fp16, bf16 | `auto`, `float16`, `bfloat16` | Any | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | Any |
...@@ -98,7 +98,7 @@ The goal of this structure is to uniquely identify a (padded) batch with minimal ...@@ -98,7 +98,7 @@ The goal of this structure is to uniquely identify a (padded) batch with minimal
### `CudagraphDispatcher` ### `CudagraphDispatcher`
The [CudagraphDispatcher][vllm.v1.cudagraph_dispatcher.CudagraphDispatcher] takes responsibility for maintaining two sets of valid dispatching keys, one set for `FULL` runtime mode and one set for `PIECEWISE` runtime mode, and dispatches the correct runtime mode and the dispatching keys before executing the model's forwards. It will take in the initial key (a rough batch_descriptor for the padded input) and return the selected runtime mode and the final batch_descriptor, then tell the CUDAGraphWarpper instances that decision through forward contexts. Notice that `CudagraphDispatcher` is the only source of truth for available CUDA Graph keys and `CUDAGraphWrapper` instances can blindly trust the forward context on what CUDA Graphs to dispatch to. This lets us simplify the wrapper code and centralize the logic in the dispatcher. The [CudagraphDispatcher][vllm.v1.cudagraph_dispatcher.CudagraphDispatcher] takes responsibility for maintaining two sets of valid dispatching keys, one set for `FULL` runtime mode and one set for `PIECEWISE` runtime mode, and dispatches the correct runtime mode and the dispatching keys before executing the model's forwards. It will take in the initial key (a rough batch_descriptor for the padded input) and return the selected runtime mode and the final batch_descriptor, then tell the CUDAGraphWrapper instances that decision through forward contexts. Notice that `CudagraphDispatcher` is the only source of truth for available CUDA Graph keys and `CUDAGraphWrapper` instances can blindly trust the forward context on what CUDA Graphs to dispatch to. This lets us simplify the wrapper code and centralize the logic in the dispatcher.
The dispatching keys are initialized through the dispatcher's `initialize_cudagraph_keys` method, which is called by the gpu_model_runner after all possible attention backends are initialized. This is where we can get much fancier in the future and “prepare” all kinds of CUDA Graphs combinations. For now, we just append available keys based on the valid combos of `decode_mode`/`mixed_mode` of `cudagraph_mode` and `cudagraph_capture_sizes` in the compilation config. The dispatching keys are initialized through the dispatcher's `initialize_cudagraph_keys` method, which is called by the gpu_model_runner after all possible attention backends are initialized. This is where we can get much fancier in the future and “prepare” all kinds of CUDA Graphs combinations. For now, we just append available keys based on the valid combos of `decode_mode`/`mixed_mode` of `cudagraph_mode` and `cudagraph_capture_sizes` in the compilation config.
...@@ -174,17 +174,18 @@ Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that ...@@ -174,17 +174,18 @@ Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that
The following table lists backends that support full CUDA Graphs at the time of writing. The following table lists backends that support full CUDA Graphs at the time of writing.
| Attention Backend | cudagraph_support | Comments | | Attention Backend | cudagraph_support | Comments |
|:---|:---|:---| | :---------------- | :---------------- | :------- |
| FlashAttention v2 | `UNIFORM_BATCH` | Actually `ALWAYS` but workaround to fallback to `FULL_AND_PIECEWISE` for performance reason | | FlashAttention v2 | `UNIFORM_BATCH` | Actually `ALWAYS` but workaround to fallback to `FULL_AND_PIECEWISE` for performance reason |
| FlashAttention v3 | `ALWAYS` | has unified routine for both batches, so `FULL` mode is good | | FlashAttention v3 | `ALWAYS` | has unified routine for both batches, so `FULL` mode is good |
| Triton Attention | `ALWAYS` | prefer `FULL_AND_PIECEWISE` since it has different kernels for prefill/mixed and pure decode batches | | Triton Attention | `ALWAYS` | prefer `FULL_AND_PIECEWISE` since it has different kernels for prefill/mixed and pure decode batches |
| AITER FlashAttention | `UNIFORM_BATCH`| | | AITER FlashAttention | `UNIFORM_BATCH` | |
| FlashInfer | `UNIFORM_SINGLE_TOKEN_DECODE` | Will be set to `UNIFORM_BATCH` when using TRTLLM attention on Blackwell | | FlashInfer | `UNIFORM_SINGLE_TOKEN_DECODE` | Will be set to `UNIFORM_BATCH` when using TRTLLM attention on Blackwell |
| FlashMLA | `UNIFORM_BATCH` | | | FlashMLA | `UNIFORM_BATCH` | |
| FlashInferMLA | `UNIFORM_BATCH` | | | FlashInferMLA | `UNIFORM_BATCH` | |
| FlashInferMLASparse | `UNIFORM_BATCH` | |
| AITER MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | | | AITER MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | |
| CUTLASS MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | | | CUTLASS MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | |
| Mamba attention| `UNIFORM_SINGLE_TOKEN_DECODE` | | | Mamba attention | `UNIFORM_SINGLE_TOKEN_DECODE` | |
Unlisted backends are all declared as `NEVER`. Unlisted backends are all declared as `NEVER`.
......
...@@ -54,6 +54,8 @@ For example: ...@@ -54,6 +54,8 @@ For example:
--8<-- "vllm/model_executor/layers/attention/mm_encoder_attention.py:mm_encoder_attn" --8<-- "vllm/model_executor/layers/attention/mm_encoder_attention.py:mm_encoder_attn"
--8<-- "vllm/model_executor/layers/mla.py:multi_head_latent_attention" --8<-- "vllm/model_executor/layers/mla.py:multi_head_latent_attention"
--8<-- "vllm/model_executor/models/deepencoder.py:rel_pos_attention"
``` ```
**2. Activation:** **2. Activation:**
......
...@@ -81,7 +81,7 @@ The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` cal ...@@ -81,7 +81,7 @@ The current implementation has all `dbo_yield` and `dbo_maybe_run_recv_hook` cal
The `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two CUDA streams, the preexisting `ForwardContexts` and a CPU thread barrier. This function should be used exclusively to instantiate `UBatchContexts`. It will handle all of the event initialization. The `make_ubatch_context` function initializes two `UBatchContexts`, one for each UBatch thread. It takes two CUDA streams, the preexisting `ForwardContexts` and a CPU thread barrier. This function should be used exclusively to instantiate `UBatchContexts`. It will handle all of the event initialization.
The `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalize` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to wait on an all-to-all kernel. The `dbo_register_recv_hook` method registers a callback that can be returned by the `FusedMoEPrepareAndFinalizeModular` class in the other UBatch thread’s `UBatchContext`. The callback will be run when the other thread calls `dbo_maybe_run_recv_hook`. This is typically used to wait on an all-to-all kernel.
The `dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists. The `dbo_maybe_run_recv_hook` method runs a callback that’s set by the `dbo_register_recv_hook` function if that callback exists.
......
...@@ -5,12 +5,12 @@ TL;DR: ...@@ -5,12 +5,12 @@ TL;DR:
- use tlparse to acquire torch.compile logs. Include these logs in bug reports and/or support asks. - use tlparse to acquire torch.compile logs. Include these logs in bug reports and/or support asks.
- The vLLM-torch.compile integration is multiple pieces. vLLM exposes flags to turn off each piece: - The vLLM-torch.compile integration is multiple pieces. vLLM exposes flags to turn off each piece:
| Online Flag | Offline Flag | Result | | Online Flag | Offline Flag | Result |
|----------|----------|-------------| | ----------- | ------------ | ------ |
| --enforce-eager | enforce_eager=True | Turn off torch.compile and CUDAGraphs | | --enforce-eager | enforce_eager=True | Turn off torch.compile and CUDAGraphs |
| -cc.mode=0 | mode=CompilationMode.NONE | Turn off torch.compile only | | -cc.mode=0 | mode=CompilationMode.NONE | Turn off torch.compile only |
| -cc.cudagraph_mode=NONE | compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE) | Turn off CUDAGraphs only | | -cc.cudagraph_mode=NONE | compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE) | Turn off CUDAGraphs only |
| -cc.backend=eager | compilation_config=CompilationConfig(backend='eager') | Turn off TorchInductor | | -cc.backend=eager | compilation_config=CompilationConfig(backend='eager') | Turn off TorchInductor |
## vLLM-torch.compile overview ## vLLM-torch.compile overview
......
...@@ -15,7 +15,7 @@ Based on the format of the input activations, FusedMoE implementations are broad ...@@ -15,7 +15,7 @@ Based on the format of the input activations, FusedMoE implementations are broad
The input activation format completely depends on the All2All Dispatch being used. The input activation format completely depends on the All2All Dispatch being used.
* In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example. * In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example.
* In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `PplxPrepareAndFinalize` or `DeepEPLLPrepareAndFinalize` for an example. * In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `DeepEPLLPrepareAndFinalize` for an example.
The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below
...@@ -37,31 +37,31 @@ The rest of the document will focus on the Contiguous / Non-Batched case. Extrap ...@@ -37,31 +37,31 @@ The rest of the document will focus on the Contiguous / Non-Batched case. Extrap
FusedMoEModularKernel splits the FusedMoE operation into 3 parts, FusedMoEModularKernel splits the FusedMoE operation into 3 parts,
1. TopKWeightAndReduce 1. TopKWeightAndReduce
2. FusedMoEPrepareAndFinalize 2. FusedMoEPrepareAndFinalizeModular
3. FusedMoEPermuteExpertsUnpermute 3. FusedMoEExpertsModular
### TopKWeightAndReduce ### TopKWeightAndReduce
The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class. The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEExpertsModular` is responsible for the Unpermute and `FusedMoEPrepareAndFinalizeModular` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEExpertsModular`. But some implementations choose to do it `FusedMoEPrepareAndFinalizeModular`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class.
Please find the implementations of TopKWeightAndReduce [here](../../vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py). Please find the implementations of TopKWeightAndReduce [here](../../vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py).
`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method. `FusedMoEPrepareAndFinalizeModular::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method.
The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens. The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEExpertsModular` and `FusedMoEPrepareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens.
* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEPermuteExpertsUnpermute` implementation does the weight application and reduction itself. * `FusedMoEExpertsModular::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEExpertsModular` implementation does the weight application and reduction itself.
* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction. * `FusedMoEExpertsModular::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEExpertsModular` implementation needs the `FusedMoEPrepareAndFinalizeModular::finalize()` to do the weight application and reduction.
### FusedMoEPrepareAndFinalize ### FusedMoEPrepareAndFinalizeModular
The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare`, `prepare_no_receive` and `finalize` functions. The `FusedMoEPrepareAndFinalizeModular` abstract class exposes `prepare`, `prepare_no_receive` and `finalize` functions.
The `prepare` function is responsible for input activation Quantization and All2All Dispatch. If implemented, The `prepare_no_receive` is like `prepare` except it does not wait to receive results from other workers. Instead it returns a "receiver" callback that must be invoked to wait for the final results of worker. It is not required that this method is supported by all `FusedMoEPrepareAndFinalize` classes, but if it is available, it can be used to interleave work with the initial all to all communication, e.g. interleaving shared experts with fused experts. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section) The `prepare` function is responsible for input activation Quantization and All2All Dispatch. If implemented, The `prepare_no_receive` is like `prepare` except it does not wait to receive results from other workers. Instead it returns a "receiver" callback that must be invoked to wait for the final results of worker. It is not required that this method is supported by all `FusedMoEPrepareAndFinalizeModular` classes, but if it is available, it can be used to interleave work with the initial all to all communication, e.g. interleaving shared experts with fused experts. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section)
![FusedMoEPrepareAndFinalize Blocks](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png) ![FusedMoEPrepareAndFinalizeModular Blocks](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png)
### FusedMoEPermuteExpertsUnpermute ### FusedMoEExpertsModular
The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions, The `FusedMoEExpertsModular` class is where the crux of the MoE operations happen. The `FusedMoEExpertsModular` abstract class exposes a few important functions,
* apply() * apply()
* workspace_shapes() * workspace_shapes()
...@@ -81,25 +81,25 @@ The `apply` method is where the implementations perform ...@@ -81,25 +81,25 @@ The `apply` method is where the implementations perform
#### workspace_shapes() #### workspace_shapes()
The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation. The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEExpertsModular::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation.
#### finalize_weight_and_reduce_impl() #### finalize_weight_and_reduce_impl()
It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section. It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEExpertsModular::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section.
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use. `FusedMoEExpertsModular::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalizeModular::finalize()` to use.
![FusedMoEPermuteExpertsUnpermute Blocks](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png) ![FusedMoEExpertsModular Blocks](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png)
### FusedMoEModularKernel ### FusedMoEModularKernel
`FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects. `FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalizeModular` and `FusedMoEExpertsModular` objects.
`FusedMoEModularKernel` pseudocode/sketch, `FusedMoEModularKernel` pseudocode/sketch,
```py ```py
class FusedMoEModularKernel: class FusedMoEModularKernel:
def __init__(self, def __init__(self,
prepare_finalize: FusedMoEPrepareAndFinalize, prepare_finalize: FusedMoEPrepareAndFinalizeModular,
fused_experts: FusedMoEPermuteExpertsUnpermute): fused_experts: FusedMoEExpertsModular):
self.prepare_finalize = prepare_finalize self.prepare_finalize = prepare_finalize
self.fused_experts = fused_experts self.fused_experts = fused_experts
...@@ -128,54 +128,50 @@ class FusedMoEModularKernel: ...@@ -128,54 +128,50 @@ class FusedMoEModularKernel:
## How-To ## How-To
### How To Add a FusedMoEPrepareAndFinalize Type ### How To Add a FusedMoEPrepareAndFinalizeModular Type
Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example, Typically a FusedMoEPrepareAndFinalizeModular type is backed by an All2All Dispatch & Combine implementation / kernel. For example,
* PplxPrepareAndFinalize type is backed by Pplx All2All kernels,
* DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughput All2All kernels, and * DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughput All2All kernels, and
* DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels. * DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels.
#### Step 1: Add an All2All manager #### Step 1: Add an All2All manager
The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](../../vllm/distributed/device_communicators/all2all.py). The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalizeModular` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](../../vllm/distributed/device_communicators/all2all.py).
#### Step 2: Add a FusedMoEPrepareAndFinalize Type #### Step 2: Add a FusedMoEPrepareAndFinalizeModular Type
This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class. This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalizeModular` abstract class.
`FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked. `FusedMoEPrepareAndFinalizeModular::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked.
`FusedMoEPrepareAndFinalize::has_prepare_no_receive()`: Indicates whether or not this subclass implements `prepare_no_receive`. Defaults to False. `FusedMoEPrepareAndFinalizeModular::has_prepare_no_receive()`: Indicates whether or not this subclass implements `prepare_no_receive`. Defaults to False.
`FusedMoEPrepareAndFinalize::prepare_no_receive()`: The prepare_no_receive method implements the Quantization and All2All Dispatch. It does not wait for the result of the dispatch operation but instead returns a thunk that can be invoked to wait for the final results. Typically the Dispatch function from the relevant All2All Manager is invoked. `FusedMoEPrepareAndFinalizeModular::prepare_no_receive()`: The prepare_no_receive method implements the Quantization and All2All Dispatch. It does not wait for the result of the dispatch operation but instead returns a thunk that can be invoked to wait for the final results. Typically the Dispatch function from the relevant All2All Manager is invoked.
`FusedMoEPrepareAndFinalize::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked. `FusedMoEPrepareAndFinalizeModular::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked.
`FusedMoEPrepareAndFinalize::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise. `FusedMoEPrepareAndFinalizeModular::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise.
`FusedMoEPrepareAndFinalize::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None. `FusedMoEPrepareAndFinalizeModular::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None.
`FusedMoEPrepareAndFinalize::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once. `FusedMoEPrepareAndFinalizeModular::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once.
`FusedMoEPrepareAndFinalize::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank(). `FusedMoEPrepareAndFinalizeModular::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank().
We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference. We suggest picking an already existing `FusedMoEPrepareAndFinalizeModular` implementation that matches your All2All implementation closely and using it as a reference.
### How To Add a FusedMoEPermuteExpertsUnpermute Type ### How To Add a FusedMoEExpertsModular Type
FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows, FusedMoEExpertsModular performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows,
`FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format. `FusedMoEExpertsModular::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format.
`FusedMoEPermuteExpertsUnpermute::supports_chunking()`: Return True if the implementation supports chunking. Typically `FusedMoEExpertsModular::supports_expert_map()`: Return True if the implementation supports expert map.
implementations that input `FusedMoEActivationFormat.Standard` support chunking and `FusedMoEActivationFormat.BatchedExperts` do not.
`FusedMoEPermuteExpertsUnpermute::supports_expert_map()`: Return True if the implementation supports expert map. `FusedMoEExpertsModular::workspace_shapes()` /
`FusedMoEExpertsModular::finalize_weight_and_reduce_impl` /
`FusedMoEPermuteExpertsUnpermute::workspace_shapes()` / `FusedMoEExpertsModular::apply`: Refer to `FusedMoEExpertsModular` section above.
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` /
`FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above.
### FusedMoEModularKernel Initialization ### FusedMoEModularKernel Initialization
...@@ -187,14 +183,14 @@ implementations that input `FusedMoEActivationFormat.Standard` support chunking ...@@ -187,14 +183,14 @@ implementations that input `FusedMoEActivationFormat.Standard` support chunking
#### maybe_make_prepare_finalize #### maybe_make_prepare_finalize
The `maybe_make_prepare_finalize` method is responsible for constructing an instance of `FusedMoEPrepareAndFinalize` when appropriate based on the current all2all backend, e.g. when EP + DP is enabled. The base class method currently constructs all the `FusedMoEPrepareAndFinalize` objects for the EP+DP case. Derived classes can override this method to construct prepare/finalize objects for different scenarios, e.g. `ModelOptNvFp4FusedMoE` can construct a `FlashInferCutlassMoEPrepareAndFinalize` for the EP+TP case. The `maybe_make_prepare_finalize` method is responsible for constructing an instance of `FusedMoEPrepareAndFinalizeModular` when appropriate based on the current all2all backend, e.g. when EP + DP is enabled. The base class method currently constructs all the `FusedMoEPrepareAndFinalizeModular` objects for the EP+DP case. Derived classes can override this method to construct prepare/finalize objects for different scenarios, e.g. `ModelOptNvFp4FusedMoE` can construct a `FlashInferCutlassMoEPrepareAndFinalize` for the EP+TP case.
Please refer to the implementations in, Please refer to the implementations in,
* `ModelOptNvFp4FusedMoE` * `ModelOptNvFp4FusedMoE`
#### select_gemm_impl #### select_gemm_impl
The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object. The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEExpertsModular` object.
Please refer to the implementations in, Please refer to the implementations in,
* `UnquantizedFusedMoEMethod` * `UnquantizedFusedMoEMethod`
...@@ -206,7 +202,7 @@ derived classes. ...@@ -206,7 +202,7 @@ derived classes.
#### init_prepare_finalize #### init_prepare_finalize
Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalizeModular` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEExpertsModular` object and builds the `FusedMoEModularKernel` object
Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188). Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188).
**Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used. **Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used.
...@@ -215,35 +211,35 @@ Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vl ...@@ -215,35 +211,35 @@ Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vl
We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py). We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py).
The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are The unit test iterates through all combinations of `FusedMoEPrepareAndFinalizeModular` and `FusedMoEPremuteExpertsUnpermute` types and if they are
compatible, runs some correctness tests. compatible, runs some correctness tests.
If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations, If you are adding some `FusedMoEPrepareAndFinalizeModular` / `FusedMoEExpertsModular` implementations,
1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](../../tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively. 1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](../../tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively.
2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`, 2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`,
`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`, `Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`
`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](../../tests/kernels/moe/modular_kernel_tools/common.py) methods in [/tests/kernels/moe/modular_kernel_tools/common.py](../../tests/kernels/moe/modular_kernel_tools/common.py)
Doing this will add the new implementation to the test suite. Doing this will add the new implementation to the test suite.
### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility ### How To Check `FusedMoEPrepareAndFinalizeModular` & `FusedMoEExpertsModular` Compatibility
The unit test file [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script. The unit test file [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script.
Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type DeepEPLLPrepareAndFinalize --experts-type BatchedTritonExperts`
As a side effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked As a side effect, this script can be used to test `FusedMoEPrepareAndFinalizeModular` & `FusedMoEExpertsModular` compatibility. When invoked
with incompatible types, the script will error. with incompatible types, the script will error.
### How To Profile ### How To Profile
Please take a look at [profile_modular_kernel.py](../../tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py) Please take a look at [profile_modular_kernel.py](../../tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py)
The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible
`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types. `FusedMoEPrepareAndFinalizeModular` and `FusedMoEExpertsModular` types.
Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type DeepEPLLPrepareAndFinalize --experts-type BatchedTritonExperts`
## FusedMoEPrepareAndFinalize Implementations ## FusedMoEPrepareAndFinalizeModular Implementations
See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-modular-all2all-backends) for a list of all the available modular prepare and finalize subclasses. See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-modular-all2all-backends) for a list of all the available modular prepare and finalize subclasses.
## FusedMoEPermuteExpertsUnpermute ## FusedMoEExpertsModular
See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-experts-kernels) for a list of all the available modular experts. See [Fused MoE Kernel features](./moe_kernel_features.md#fused-moe-experts-kernels) for a list of all the available modular experts.
# Fusion torch.compile passes
vLLM applies a set of kernel/operator fusions at compile time (via custom [`torch.compile`](torch_compile.md) Inductor passes)
to separate optimizations from model definitions and avoid breaking layer abstractions in model code.
These fusions are controlled by fields in [`PassConfig`][vllm.config.compilation.PassConfig] and are automatically enabled
at appropriate [optimization levels](optimization_levels.md).
## Quick Reference
The table below maps each fusion to its controlling flag/config knob, the
operations it fuses, what level enables it by default, and an indicative speedup.
The Fullgraph column indicates whether the fusion requires the entire model graph to be
visible (either via Inductor partition or `splitting_ops=[]`),
and the last column indicates whether the fusion activates for all `num_tokens`
or just on the low or high end.
!!! info
Speedup depends heavily on the exact model, batch size, and hardware.
If tuning performance by hand, always benchmark your exact use-case with and without the fusion to verify the impact.
| Fusion | `PassConfig` flag | Fused operations | Default at | E2E Speedup | Fullgraph | `num_tokens` |
| ------------------------------------------------------------------------------ | ---------------------------- | ---------------------------------------------- | ------------------------------ | ------------------ | --------- | ------------ |
| [AllReduce + RMSNorm](#allreduce--rmsnorm-fuse_allreduce_rms) | `fuse_allreduce_rms` | All-reduce → RMSNorm (+residual_add) (→ quant) | O2 (Hopper/Blackwell + TP > 1) | 5-20% | No | Low |
| [Attention + Quant](#attention--quantization-fuse_attn_quant) | `fuse_attn_quant` | Attention output → FP8/NVFP4 quant | Off by default | 3-7% | Yes | Always |
| [RoPE + KV-Cache Update](#rope--kv-cache-update-fuse_rope_kvcache) | `fuse_rope_kvcache` | Rotary embedding → KV cache write | O1 (ROCm/AITER only) | TBD | No | Low |
| [QK Norm + RoPE](#qk-norm--rope-enable_qk_norm_rope_fusion) | `enable_qk_norm_rope_fusion` | Q/K RMSNorm → rotary embedding | Off by default | 2-3% | No | Low |
| [Sequence Parallelism](#sequence-parallelism-enable_sp) | `enable_sp` | AllReduce → ReduceScatter + AllGather | Off by default | Prereq for AsyncTP | Yes | High |
| [AsyncTP GEMM + collective](#asynctp-gemm--collective-overlap-fuse_gemm_comms) | `fuse_gemm_comms` | GEMM → reduce-scatter / all-gather → GEMM | Off by default | 7-10% | Yes | High |
| [RMSNorm + Quant](#rmsnorm--quantization-fuse_norm_quant) | `fuse_norm_quant` | RMSNorm (+residual add) → FP8/FP4 quant | O1 (conditional) | 1-4% | No | Always |
| [SiLU+Mul + Quant](#silumul--quantization-fuse_act_quant) | `fuse_act_quant` | SiLU+Mul activation → FP8/FP4 quant | O1 (conditional) | 1-4% | No | Always |
| [RMSNorm + Padding](#rmsnorm--padding-fuse_act_padding) | `fuse_act_padding` | Residual add + RMSNorm → padding | O1 (ROCm/AITER only) | TBD | No | Always |
## Support Matrix
The table below lists the quantization schemes supported by each fusion on each platform.
**—** means the fusion is not available on that platform. The latest and in-progress work is available in the tracking issue:
[#36066](https://github.com/vllm-project/vllm/issues/36066)
| Fusion | SM100 (Blackwell) | SM90 (Hopper) | SM89 (Ada) | SM80 (Ampere) | ROCm |
| ---------------------------- | ---------------------------------------- | ---------------------------------------- | ---------------------------------------- | ------------- | ---------------------------------------- |
| `fuse_allreduce_rms` | FP16/BF16, FP8 static, NVFP4 | FP16/BF16, FP8 static | — | — | — |
| `fuse_attn_quant`\* | FP8 static\*, NVFP4\* | FP8 static\* | FP8 static\* | — | FP8 static\* |
| `fuse_rope_kvcache` | — | — | — | — | FP16/BF16 |
| `enable_qk_norm_rope_fusion` | FP16/BF16 | FP16/BF16 | FP16/BF16† | FP16/BF16† | — |
| `enable_sp` | FP16/BF16, FP8 static† | FP16/BF16, FP8 static | FP16/BF16† | FP16/BF16† | — |
| `fuse_gemm_comms` | FP16/BF16, FP8 static† | FP16/BF16, FP8 static | FP16/BF16† | FP16/BF16† | — |
| `fuse_norm_quant` | FP8 static, FP8 per-token, FP8 per-group | FP8 static, FP8 per-token, FP8 per-group | FP8 static, FP8 per-token, FP8 per-group | — | FP8 static, FP8 per-token, FP8 per-group |
| `fuse_act_quant` | FP8 static, NVFP4 | FP8 static | FP8 static | — | FP8 per-group |
| `fuse_act_padding` | — | — | — | — | FP16/BF16 |
\* `fuse_attn_quant` support depends on the attention backend in use; not all backends support
fused quantization output. See the [`fuse_attn_quant` section](#attention--quantization-fuse_attn_quant)
for per-backend details.
`enable_sp` and `fuse_gemm_comms` are only autoconfigured for SM90 today;
other architectures support requires setting `PassConfig.sp_min_token_num` explicitly.
SM100 support also requires setting `VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel`.
## Enabling / Disabling Fusions
Fusions are exposed through `PassConfig`, which is nested inside `CompilationConfig`:
```python
from vllm import LLM
from vllm.config import CompilationConfig, PassConfig
llm = LLM(
model="...",
optimization_level=2, # Default optimization level
compilation_config=CompilationConfig(
pass_config=PassConfig(
fuse_norm_quant=True,
fuse_act_quant=True,
fuse_allreduce_rms=False, # disable a specific fusion
)
),
)
```
Fusions can also be enabled using command-line flags with any `vllm ...` command:
```bash
# Enable O2 defaults, but turn off allreduce fusion
vllm serve meta-llama/Llama-3.1-8B-Instruct -O2 -cc.pass_config.fuse_allreduce_rms=False
# The above is equivalent to the more verbose:
vllm serve meta-llama/Llama-3.1-8B-Instruct -O2 --compilation-config '{"pass_config": {"fuse_allreduce_rms": false}}'
# Same syntax in other commands, e.g. vllm bench:
vllm bench latency --model=meta-llama/Llama-3.1-8B-Instruct -O2 -cc.pass_config.fuse_allreduce_rms=False
```
Fields set explicitly by the user always take precedence over optimization-level defaults.
## Fusion Details
### AllReduce + RMSNorm (`fuse_allreduce_rms`)
!!! warning
TP+DP and TP+PP combinations are currently broken
([#34458](https://github.com/vllm-project/vllm/issues/34458) and
[#35426](https://github.com/vllm-project/vllm/issues/35426)).
Only supported on NVIDIA Hopper (SM90) and Blackwell (SM100) with FlashInfer installed.
**What it fuses.** Fuses the tensor-parallel all-reduce collective with the subsequent residual add,
RMSNorm, and optionally a quantization step into a single FlashInfer / TRT-LLM communication kernel.
This fusion is only profitable for small `num_tokens`,
so the fusion is only performed in the lower compiled range.
Patterns covered:
- `AllReduce → RMSNorm(+residual_add)`: CUDA sm90+ with FlashInfer
- `AllReduce → RMSNorm(+residual_add) → FP8 static quant`: CUDA sm90+ with FlashInfer
- `AllReduce → RMSNorm(+residual_add) → NVFP4 dynamic quant`: CUDA sm100+ with FlashInfer
The maximum tensor size below which the fused kernel is used is hardware-dependent (64 MB for TP=2
on SM90/SM100) and configurable via `PassConfig.fi_allreduce_fusion_max_size_mb`.
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/allreduce_rms_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/allreduce_rms_fusion.py)
- FlashInfer all-reduce: [`vllm/distributed/device_communicators/flashinfer_all_reduce.py`](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/device_communicators/flashinfer_all_reduce.py)
- Benchmark: [`benchmarks/kernels/benchmark_fused_collective.py`](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_fused_collective.py)
### Attention + Quantization (`fuse_attn_quant`)
!!! info
`fuse_attn_quant` is currently not enabled at any optimization level by default and must be set
explicitly. It requires the full model graph to be visible (Inductor partition or `splitting_ops=[]`).
**What it fuses.** Fuses the attention output quantization directly after the attention computation,
eliminating a full-precision memory round-trip of the attention output. Patterns covered:
`Attention → FP8 static quant`:
- `TRITON_ATTN`: CUDA, ROCm
- `FLASHINFER`: CUDA sm100+ with FlashInfer installed
- `ROCM_ATTN`: ROCm
- `ROCM_AITER_UNIFIED_ATTN`: ROCm with AITER
`Attention → NVFP4 dynamic quant`:
- `FLASHINFER`: CUDA sm100+ with FlashInfer installed
Other attention backends do not support fused output quantization yet.
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/attn_quant_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/attn_quant_fusion.py)
- Attention backends: [`vllm/v1/attention/backends/`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/)
### RoPE + KV-Cache Update (`fuse_rope_kvcache`)
!!! info
ROCm/AITER-only. Not available on NVIDIA CUDA or CPU. The fusion is only enabled for
`num_tokens ≤ 256` by default due to AITER fused kernel performance issues.
This threshold is configurable via `PassConfig.rope_kvcache_fusion_max_token_num`.
**What it fuses.** Fuses the rotary positional embedding kernel with the KV-cache scatter/write into
a single kernel, avoiding separate reads and writes of the key and value tensors.
Requires: AMD ROCm with AITER enabled, the `rotary_embedding` custom op active (automatic),
and the `kv_cache` update op visible in the graph: either by using Inductor graph partition
or removed from `splitting_ops`.
If these conditions are set, the fusion is enabled automatically for optimization level O1 and above.
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/rope_kvcache_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/rope_kvcache_fusion.py)
### Sequence Parallelism (`enable_sp`)
**What it fuses.** Replaces all-reduce collectives with reduce-scatter + local RMSNorm + all-gather,
splitting the sequence dimension across TP ranks. This restructures the graph so the subsequent AsyncTP
pass can fuse the reduce-scatter / all-gather with the surrounding GEMMs.
Sequence Parallelism itself does not directly improve performance; it is a prerequisite for the
AsyncTP pass (`fuse_gemm_comms`). SP is only applied above a minimum token threshold that is
autoconfigured based on device capability and model `hidden_size`. Currently only active on
H100/SM90 for models with `hidden_size >= 8192`. The threshold is configurable via
`PassConfig.sp_min_token_num`.
The general transformation:
```text
Input → AllReduce → RMSNorm → Output
becomes:
Input → ReduceScatter → local RMSNorm → AllGather → Output
```
Patterns covered:
- First block: `AllReduce → RMSNorm``ReduceScatter → RMSNorm → AllGather`
- Middle blocks: `AllReduce → fused_add_RMSNorm``ReduceScatter → fused_add_RMSNorm → AllGather`
- Both with optional `→ FP8 static quant` suffix
Requires: `use_inductor_graph_partition=True` **or** piecewise compilation with static sizes
divisible by `tensor_parallel_size`.
Supported hardware: Only tested on NVIDIA CUDA, possibly works on ROCm. FP8 all-gather requires sm90+.
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/sequence_parallelism.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/sequence_parallelism.py)
### AsyncTP GEMM + Collective Overlap (`fuse_gemm_comms`)
!!! info
Requires `enable_sp=True` (enabled automatically). This pass is a no-op if Sequence Parallelism has not been applied.
**What it fuses.** After Sequence Parallelism transforms the graph, fuses GEMM kernels with the
surrounding reduce-scatter (output projection) and all-gather (input projection) using
`torch.ops.symm_mem` symmetric-memory primitives, overlapping communication and computation.
This overlap is only profitable for large `num_tokens`, so the fusion (and preceding SP)
is only performed in the higher compiled range above `PassConfig.sp_min_token_num`.
Patterns covered:
- `GEMM → reduce-scatter``fused_matmul_reduce_scatter`
- `all-gather → GEMM``all_gather_matmul`
- FP8 scaled variants of both patterns
Supported hardware: NVIDIA CUDA with symmetric-memory (`torch.distributed._symmetric_memory`) support.
On B200, pattern-matching fp8 FlashInfer scaled MM is not supported, so it must be disabled
([#27893](https://github.com/vllm-project/vllm/issues/27893))
```shell
VLLM_DISABLED_KERNELS=FlashInferFP8ScaledMMLinearKernel ...
```
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/collective_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/collective_fusion.py)
- Sequence parallelism pass: [`vllm/compilation/passes/fusion/sequence_parallelism.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/sequence_parallelism.py)
### QK Norm + RoPE (`enable_qk_norm_rope_fusion`)
!!! info
Only applicable to models that apply per-head RMSNorm to Q and K before rotary positional
embedding (e.g. Qwen). Not enabled by default at any optimization level due to perf issues on H100:
[#34391](https://github.com/vllm-project/vllm/issues/34391)
**What it fuses.** Fuses the sequence: split QKV → reshape → Q/K RMSNorm → reshape → rotary
embedding into a single `fused_qk_norm_rope` CUDA kernel.
```text
# Unfused:
q, k, v = split(qkv)
q_norm = rms_norm(q.view(heads))
k_norm = rms_norm(k.view(kv_heads))
q_rope, k_rope = rotary_embedding(q_norm, k_norm, ...)
# Fused:
fused_qk_norm_rope(qkv, ...)
```
Supported hardware: CUDA (sm80+) only, tested only on sm90 and sm100.
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/qk_norm_rope_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/qk_norm_rope_fusion.py)
- CUDA kernel: [`csrc/ops.h`](https://github.com/vllm-project/vllm/blob/main/csrc/ops.h) (`fused_qk_norm_rope`)
### RMSNorm + Quantization (`fuse_norm_quant`)
!!! warning
On NVIDIA, Inductor actually generates a faster fused kernel than our custom CUDA kernel.
Hence, this fusion is only enabled when either `rms_norm` or `quant_fp8` is using a custom kernel.
**What it fuses.** Combines the custom `rms_norm` / `fused_add_rms_norm`
operations with subsequent quantization into a single fused kernel,
eliminating an intermediate read/write of the full-precision activation tensor.
Two variants are fused:
- *Plain RMSNorm + quant*: `rms_norm(x) → quant_fp8(y)`
- *Fused-add RMSNorm + quant*: `fused_add_rms_norm(x, residual) → quant_fp8(y)` — also updates the residual in-place.
Note that AITER fusions are currently in a separate pass in `vllm.compilation.passes.fusion.rocm_aiter_fusion`.
Supported quantization scheme/hardware combinations:
- FP8 static per-tensor: CUDA & HIP kernel
- FP8 dynamic per-token: CUDA & HIP kernel, AITER
- FP8 dynamic per-token-group (128/64): CUDA & HIP kernel, AITER
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/rms_quant_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/rms_quant_fusion.py)
- ROCm AITER pass: [`vllm/compilation/passes/fusion/rocm_aiter_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/rocm_aiter_fusion.py)
- CUDA/HIP kernels: [`csrc/layernorm_quant_kernels.cu`](https://github.com/vllm-project/vllm/blob/main/csrc/layernorm_quant_kernels.cu)
### SiLU+Mul + Quantization (`fuse_act_quant`)
!!! warning
Same as `fuse_norm_quant`: on NVIDIA, Inductor generates a faster fused kernel than our custom ops.
This fusion is only enabled when either `silu_and_mul` or `quant_fp8` are using a custom kernel,
or for NVFP4-quantized models (where FP4 quant is always a custom op).
**What it fuses.** Fuses the `silu_and_mul` gate-up projection activation with subsequent quantization into a single kernel,
avoiding materialization of the full-precision post-activation tensor.
Note that AITER fusions are in a separate pass in `vllm.compilation.passes.fusion.rocm_aiter_fusion`.
Supported quantization scheme/hardware combinations:
- FP8 static per-tensor: CUDA & HIP kernel
- NVFP4 dynamic: CUDA sm100+ only with FlashInfer
- FP8 per-token-group (128): ROCm AITER only
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/act_quant_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/act_quant_fusion.py)
- ROCm AITER pass: [`vllm/compilation/passes/fusion/rocm_aiter_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/rocm_aiter_fusion.py)
- CUDA/HIP kernels: [`csrc/quantization/`](https://github.com/vllm-project/vllm/blob/main/csrc/quantization/)
### RMSNorm + Padding (`fuse_act_padding`)
!!! info
ROCm/AITER-only. Targeted at GPT-OSS models.
**What it fuses.** Fuses a residual add + RMSNorm with a subsequent padding operation that pads
the hidden dimension to a multiple required by downstream AITER Triton GEMM kernels.
Requires: AMD ROCm with AITER RMSNorm enabled. Enabled by default in optimization level O1 and above
when the hidden size is 2880 and AITER Triton GEMMs *not* enabled.
**Code locations.**
- Pass: [`vllm/compilation/passes/fusion/rocm_aiter_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/rocm_aiter_fusion.py) (`RocmAiterTritonAddRMSNormPadFusionPass`)
## See Also
- [Optimization Levels](optimization_levels.md) — high-level presets that set
fusion defaults.
- [torch.compile in vLLM](torch_compile.md) — how the Inductor pass pipeline
works.
- [Attention Backends](attention_backends.md) — attention-specific kernel
selection.
...@@ -13,9 +13,28 @@ IOProcessorInput = TypeVar("IOProcessorInput") ...@@ -13,9 +13,28 @@ IOProcessorInput = TypeVar("IOProcessorInput")
IOProcessorOutput = TypeVar("IOProcessorOutput") IOProcessorOutput = TypeVar("IOProcessorOutput")
class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]): class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]):
def __init__(self, vllm_config: VllmConfig): """Abstract interface for pre/post-processing of engine I/O."""
def __init__(self, vllm_config: VllmConfig, renderer: BaseRenderer):
super().__init__()
self.vllm_config = vllm_config self.vllm_config = vllm_config
def parse_data(self, data: object) -> IOProcessorInput:
raise NotImplementedError
def merge_sampling_params(
self,
params: SamplingParams | None = None,
) -> SamplingParams:
return params or SamplingParams()
def merge_pooling_params(
self,
params: PoolingParams | None = None,
) -> PoolingParams:
return params or PoolingParams(task="plugin")
@abstractmethod @abstractmethod
def pre_process( def pre_process(
self, self,
...@@ -55,29 +74,13 @@ class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]): ...@@ -55,29 +74,13 @@ class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]):
[(i, item) async for i, item in model_output], key=lambda output: output[0] [(i, item) async for i, item in model_output], key=lambda output: output[0]
) )
collected_output = [output[1] for output in sorted_output] collected_output = [output[1] for output in sorted_output]
return self.post_process(collected_output, request_id, **kwargs) return self.post_process(collected_output, request_id=request_id, **kwargs)
@abstractmethod
def parse_request(self, request: Any) -> IOProcessorInput:
raise NotImplementedError
def validate_or_generate_params(
self, params: SamplingParams | PoolingParams | None = None
) -> SamplingParams | PoolingParams:
return params or PoolingParams()
@abstractmethod
def output_to_response(
self, plugin_output: IOProcessorOutput
) -> IOProcessorResponse:
raise NotImplementedError
``` ```
The `parse_request` method is used for validating the user prompt and converting it into the input expected by the `pre_process`/`pre_process_async` methods. The `parse_data` method is used for validating the user data and converting it into the input expected by the `pre_process*` methods.
The `merge_sampling_params` and `merge_pooling_params` methods merge input `SamplingParams` or `PoolingParams` (if any) with the default one.
The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference. The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference.
The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output. The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output.
The `validate_or_generate_params` method is used for validating with the plugin any `SamplingParameters`/`PoolingParameters` received with the user request, or to generate new ones if none are specified. The function always returns the validated/generated parameters.
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/pooling/pooling/serving.py).
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/pooling/plugin/prithvi_geospatial_mae_online.py](../../examples/pooling/plugin/prithvi_geospatial_mae_online.py)) and offline ([examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py](../../examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py)) inference examples. An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/pooling/plugin/prithvi_geospatial_mae_online.py](../../examples/pooling/plugin/prithvi_geospatial_mae_online.py)) and offline ([examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py](../../examples/pooling/plugin/prithvi_geospatial_mae_io_processor.py)) inference examples.
......
...@@ -352,7 +352,7 @@ The `BatchUpdate` abstraction models the persistent batch as a list of requests, ...@@ -352,7 +352,7 @@ The `BatchUpdate` abstraction models the persistent batch as a list of requests,
(s, d, UNIDIRECTIONAL or SWAP) (s, d, UNIDIRECTIONAL or SWAP)
``` ```
* If the Move specifies `UNIDRECTIONAL`: * If the Move specifies `UNIDIRECTIONAL`:
* The request at index `s` is moved to index `d`; index `s` becomes an empty slot * The request at index `s` is moved to index `d`; index `s` becomes an empty slot
......
...@@ -507,10 +507,10 @@ longer relevant in v1: ...@@ -507,10 +507,10 @@ longer relevant in v1:
- `vllm:num_requests_swapped` - `vllm:num_requests_swapped`
- `vllm:cpu_cache_usage_perc` - `vllm:cpu_cache_usage_perc`
In this mode, when a request is preempted (e.g. to make room in KV In this mode, when a request was preempted (e.g. to make room in KV
cache to complete other requests), we swap kv cache blocks out to CPU cache to complete other requests), kv cache blocks were swapped out to
memory. This is also known as "KV cache offloading" and is configured CPU memory. The `--swap-space` flag has been removed as this feature
with `--swap-space` and `--preemption-mode`. is no longer used in V1.
Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
SequenceGroup encapsulated the idea of N Sequences which SequenceGroup encapsulated the idea of N Sequences which
...@@ -656,7 +656,7 @@ vLLM has support for OpenTelemetry tracing: ...@@ -656,7 +656,7 @@ vLLM has support for OpenTelemetry tracing:
- Added by <https://github.com/vllm-project/vllm/pull/4687> and reinstated by <https://github.com/vllm-project/vllm/pull/20372> - Added by <https://github.com/vllm-project/vllm/pull/4687> and reinstated by <https://github.com/vllm-project/vllm/pull/20372>
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
- [User-facing docs](../examples/online_serving/opentelemetry.md) - [User-facing docs](../../examples/online_serving/opentelemetry/README.md)
- [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview) - [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
......
# Model Runner V2 Design Document
## Introduction
Since vLLM V1 was first implemented, we discovered several fundamental design mistakes and accumulated significant technical debt. Many features were bolted on that were not considered in the original design. We also gained valuable insights into sampling techniques (for example, Gumbel-max sampling), tools (for example, Triton), and CUDA features (for example, UVA). With this knowledge, we implemented Model Runner V2 (MRV2) from first principles to be cleaner, more efficient, and more modular.
In hindsight, many of V1's design choices were suboptimal. While MRV2 is not yet feature-complete, not rigorously tested, and still has open design decisions, we believe it is a substantial improvement over V1.
This document describes the design of MRV2.
## 1. Persistent Batch
One significant source of friction in V1 is its persistent batch implementation.
### Background
V1 introduced persistent batches to minimize CPU overhead during input preparation. When requests are scheduled for a step, the model runner must construct contiguous input tensors (for example, block tables and per-request temperature values) to feed into the model. Building these tensors from scratch each step is often very slow in Python, especially for large tensors like block tables.
The persistent batch optimization exploits the fact that request batches in consecutive steps are mostly identical. Only a few requests (if any) join or finish per step. By maintaining persistent state tensors and applying incremental diffs instead of reconstructing inputs from scratch, CPU overhead can be reduced significantly.
### Problems with V1's Approach
While efficient, V1's persistent batch design introduced unnecessary complexity due to coupling persistent state with input tensors. V1 uses persistent state tensors directly as model and sampler inputs, which imposes strict layout and ordering requirements. When requests join or finish, this often requires complex tensor-wide reordering rather than simple row insertion/removal.
V1 also had to maintain `CachedRequestState`, a redundant backup copy of request state, because rows in persistent tensors can be overwritten while requests are still active.
The result is complex bookkeeping that becomes more difficult under async scheduling.
![Persistent Batch in V1](../assets/design/model_runner_v2/persistent_batch_v1.png)
### MRV2's Solution
MRV2 decouples persistent state tensors from per-step input tensors. Given request ordering for the step (usually determined by the attention backend), MRV2 gathers input tensors from persistent state.
1. Pre-allocate a fixed-size tensor with `max_num_reqs` rows (1024 by default on most platforms).
2. Assign each request a permanent row for its active lifetime (until finish or preemption).
3. Treat preemption as completion. On resume, re-add request data as fresh state.
This removes the need for `CachedRequestState` and simplifies bookkeeping. Large state tensors are mostly stored on GPU memory, so gather runs in parallel on the GPU with low overhead.
![Persistent Batch in MRV2](../assets/design/model_runner_v2/persistent_batch_mrv2.png)
## 2. Async-First
vLLM now relies heavily on asynchronous scheduling. The scheduler and worker prepare inputs for step `N+1` while the GPU executes step `N`, overlapping CPU and GPU work to maximize utilization.
V1 was not originally designed with async scheduling in mind, and support required retrofitted behavior and hacks. MRV2 instead assumes the core model execution loop is a CUDA stream with no CPU synchronization points. CPU entrypoints queue work onto the stream.
![Async execution timeline](../assets/design/model_runner_v2/async_sched.png)
## 3. Removing Async Barrier
A key requirement for async execution is that CPU operations remain non-blocking. Both explicit sync (for example, `torch.accelerator.synchronize`) and implicit sync (for example, unpinned `.to("cuda")`) must be avoided.
However, async execution can introduce race conditions when CPU and GPU concurrently touch the same memory.
Example (unsafe):
```python
class ModelRunner:
def __init__(self, ...):
# Pinned buffer
self.states = torch.zeros(
max_num_reqs, dtype=torch.int32, device="cpu", pin_memory=True
)
def execute_step(self, ...):
self.states[req_idx] = new_req.data
states = self.states.to("cuda", non_blocking=True)
```
The CPU may modify `self.states` while GPU is still reading from it via async copy.
V1 addresses this with an async barrier around critical sections. That avoids races but has drawbacks:
1. Easy to miss protected buffers (bug-prone).
2. Inflexible organization (all CPU work must stay inside barrier).
3. Potentially less overlap due to synchronization.
![Race condition with shared CPU buffer](../assets/design/model_runner_v2/async_race_condition.png)
### MRV2's Solution: Eliminate the Race
MRV2 separates persistent CPU state from the copied tensor:
```python
class ModelRunner:
def __init__(self, ...):
# Not pinned
self.states = torch.zeros(
max_num_reqs, dtype=torch.int32, device="cpu", pin_memory=False
)
def execute_step(self, ...):
self.states[req_idx] = new_req.data
tmp_states = self.states.pin_memory()
states = tmp_states.to("cuda", non_blocking=True)
```
Now CPU writes to `self.states` while GPU reads from `tmp_states`, eliminating the race without explicit synchronization.
![No race with temporary pinned copy](../assets/design/model_runner_v2/async_no_race_condition.png)
## 4. StagedWriteTensor
For large tensors like block tables, MRV2 avoids full CPU-to-GPU copies each step by using `StagedWriteTensor`:
1. Keep the base tensor on GPU.
2. Stage diffs on CPU.
3. Pack diffs into contiguous buffers.
4. Copy packed diffs to GPU.
5. Launch one kernel to apply diffs.
Example usage:
```python
# Initialize state on GPU
state = StagedWriteTensor(size=(1024, 1000), dtype=torch.int32, device="cuda")
# Write [3, 1, 2] into row 2, starting at index 3
state.stage_write(row=2, start=3, value=[3, 1, 2])
# Write [-1, -2, -5] into row 0, starting at index 1
state.stage_write(row=0, start=1, value=[-1, -2, -5])
# Apply staged changes
state.apply_write()
```
This supports ragged updates with no CPU-GPU synchronization and minimal kernel launches. It is especially useful for block tables and mixed CPU/GPU-written states such as `num_computed_tokens`.
## 5. GPU-Native Input Metadata Preparation and Output Processing
MRV2 uses Triton kernels to prepare inputs such as `input_ids`, `positions`, `query_start_loc`, and `seq_lens`.
Benefits:
1. Better async behavior: GPU can derive values (for example with speculative decoding) that CPU may not know yet.
2. Lower CPU overhead: input prep is very cheap on GPU and avoids Python bottlenecks.
### Universal Virtual Addressing (UVA)
MRV2 uses UVA in some paths to let GPU kernels access large CPU-resident tensors directly (for example `prefill_token_ids`) without duplicating those tensors into GPU memory.
## 6. Triton-Native Sampler
MRV2 reimplements sampling mostly in Triton for better numeric/memory control and optimization.
### Gumbel Sampling Kernel
MRV2 introduces a Triton Gumbel sampling kernel that avoids explicit softmax materialization and uses stateless in-kernel RNG from seed input.
### Efficient Top-K Logprobs
V1 materializes full-vocabulary logprobs before top-k. MRV2 identifies top-k tokens from logits first, then computes logprobs only for selected tokens. This reduces peak GPU memory usage.
### Memory-Efficient Prompt Logprobs
MRV2 supports finer-grained chunking, including chunking inside a single prompt, to avoid memory spikes on long prompts.
### Better Compatibility with Speculative Decoding
Instead of expanding per-request sampling states to match per-logit shapes, MRV2 uses indirection (`idx_mapping`) inside kernels to map each logits vector to the right request state. This simplifies support for complex sampling parameters and logits processors.
## 7. Modularity
MRV2 emphasizes modularity. Compared to V1's large, entangled `gpu_model_runner.py`, MRV2 splits feature logic across dedicated files (for example, `mrope_utils.py`, `penalties.py`, and many others).
It also consolidates model inputs into an `InputBatch` class and reduces direct model-runner attribute coupling.
## 8. No Abuse of `dummy_run`
In V1, `dummy_run` handled too many responsibilities:
- Initial memory profiling and `torch.compile`
- CUDA graph capture
- Warmups
- Empty DP forward passes for EP+DP
MRV2 simplifies this:
1. `execute_model` supports dummy runs without affecting state.
2. `dummy_run` delegates to `execute_model` for profiling, warmup, and empty DP forward passes.
3. CUDA graph capture uses a separate dedicated path.
This reduces complexity and removes bugs caused by divergence between `execute_model` and `dummy_run` behavior.
## 9. Explicit CUDA Graph Management
V1's CUDA graph handling is implicit and hard to reason about. MRV2 uses a `CUDAGraphManager` that explicitly captures and launches full CUDA graphs through standard PyTorch APIs.
This makes graph lifecycle and execution mode decisions more understandable and easier to extend. Example: MRV2 can capture multiple draft-model forward passes into one CUDA graph.
## Development Philosophy
MRV2 changes should meet a higher code quality bar. As feature gaps with V1 are filled, features should be reconsidered from first principles in the MRV2 design context instead of quickly porting V1 behavior.
A key requirement is preserving modularity and clean abstraction boundaries, even if that requires more upfront design iteration.
...@@ -4,17 +4,17 @@ The purpose of this document is to provide an overview of the various MoE kernel ...@@ -4,17 +4,17 @@ The purpose of this document is to provide an overview of the various MoE kernel
## Fused MoE Modular All2All backends ## Fused MoE Modular All2All backends
There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the `FusedMoE` layer. The different `FusedMoEPrepareAndFinalize` subclasses provide an interface for each all2all backend. There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the `FusedMoE` layer. The different `FusedMoEPrepareAndFinalizeModular` subclasses provide an interface for each all2all backend.
The following table describes the relevant features of each backend, i.e. activation format, supported quantization schemes and async support. The following table describes the relevant features of each backend, i.e. activation format, supported quantization schemes and async support.
The output activation format (standard or batched) corresponds to the output of the prepare step of the `FusedMoEPrepareAndFinalize` subclass, and the finalize step requires the same format. All the backend `prepare` methods expect activations in the standard format and all the `finalize` methods return activations in standard format. More details on the formats can be found in the [Fused MoE Modular Kernel](./fused_moe_modular_kernel.md) document. The output activation format (standard or batched) corresponds to the output of the prepare step of the `FusedMoEPrepareAndFinalizeModular` subclass, and the finalize step requires the same format. All the backend `prepare` methods expect activations in the standard format and all the `finalize` methods return activations in standard format. More details on the formats can be found in the [Fused MoE Modular Kernel](./fused_moe_modular_kernel.md) document.
The quantization types and formats enumerate which quantization schemes are supported by each `FusedMoEPrepareAndFinalize` class. The quantization can happen before or after the dispatch based on the format the all2all backend supports, e.g. deepep_high_throughput supports only block-quantized fp8 format. Any other format will result in dispatching in higher precision and quantizing afterwards. The output of the prepare step for each backend is the quantized type. The finalize step generally requires the same input type as the original activations, e.g. if the original input is bfloat16 and the quantization scheme is fp8 with per-tensor scales, `prepare` will return fp8/per-tensor scale activations and `finalize` will take bfloat16 activations. See the diagrams in [Fused MoE Modular Kernel](./fused_moe_modular_kernel.md) for more details on the types and formats of activations at each step of the MoE process. If no quantization type is specified, the kernel operates on float16 and/or bfloat16. The quantization types and formats enumerate which quantization schemes are supported by each `FusedMoEPrepareAndFinalizeModular` class. The quantization can happen before or after the dispatch based on the format the all2all backend supports, e.g. deepep_high_throughput supports only block-quantized fp8 format. Any other format will result in dispatching in higher precision and quantizing afterwards. The output of the prepare step for each backend is the quantized type. The finalize step generally requires the same input type as the original activations, e.g. if the original input is bfloat16 and the quantization scheme is fp8 with per-tensor scales, `prepare` will return fp8/per-tensor scale activations and `finalize` will take bfloat16 activations. See the diagrams in [Fused MoE Modular Kernel](./fused_moe_modular_kernel.md) for more details on the types and formats of activations at each step of the MoE process. If no quantization type is specified, the kernel operates on float16 and/or bfloat16.
Async backends support the use of DBO (Dual Batch Overlap) and shared expert overlap (where shared experts are computed during the combine step). Async backends support the use of DBO (Dual Batch Overlap) and shared expert overlap (where shared experts are computed during the combine step).
Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. Llama. For modular kernels, this feature is supported by the `FusedMoEPrepareAndFinalize` subclass. For non-modular kernels, it is up to the experts function to deal with this flag. Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. Llama. For modular kernels, this feature is supported by the `FusedMoEPrepareAndFinalizeModular` subclass. For non-modular kernels, it is up to the experts function to deal with this flag.
Unless otherwise specified, backends are controlled via the `--all2all-backend` command-line argument (or the `all2all_backend` parameter in `ParallelConfig`). All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP. Unless otherwise specified, backends are controlled via the `--all2all-backend` command-line argument (or the `all2all_backend` parameter in `ParallelConfig`). All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP.
...@@ -31,15 +31,12 @@ th { ...@@ -31,15 +31,12 @@ th {
</style> </style>
| Backend | Output act. format | Quant. types | Quant. format | Async | Apply Weight On Input | Subclass | | Backend | Output act. format | Quant. types | Quant. format | Async | Apply Weight On Input | Subclass |
|---------|--------------------|--------------|---------------|-------|-----------------------|-----------| | ------- | ------------------ | ------------ | ------------- | ----- | --------------------- | --------- |
| naive | standard | all<sup>1</sup> | G,A,T | N | <sup>6</sup> | [layer.py][vllm.model_executor.layers.fused_moe.layer.FusedMoE.forward_impl] | | naive | standard | all<sup>1</sup> | G,A,T | N | <sup>6</sup> | [layer.py][vllm.model_executor.layers.fused_moe.layer.FusedMoE] |
| pplx | batched | fp8,int8 | G,A,T | Y | Y | [`PplxPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.pplx_prepare_finalize.PplxPrepareAndFinalize] | | deepep_high_throughput | standard | fp8 | G(128),A,T<sup>2</sup> | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] |
| deepep_high_throughput | standard | fp8 | G(128),A,T<sup>2</sup> | Y | Y | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize] | | deepep_low_latency | batched | fp8 | G(128),A,T<sup>3</sup> | Y | Y | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize] |
| deepep_low_latency | batched | fp8 | G(128),A,T<sup>3</sup> | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] | | flashinfer_nvlink_two_sided | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferNVLinkTwoSidedPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_nvlink_two_sided_prepare_finalize.FlashInferNVLinkTwoSidedPrepareAndFinalize] |
| flashinfer_all2allv | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferAllToAllMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferAllToAllMoEPrepareAndFinalize] | | flashinfer_nvlink_one_sided | standard | nvfp4 | G,A,T | N | N | [`FlashInferNVLinkOneSidedPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_nvlink_one_sided_prepare_finalize.FlashInferNVLinkOneSidedPrepareAndFinalize] |
| flashinfer<sup>4</sup> | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize] |
| MoEPrepareAndFinalizeNoEP<sup>5</sup> | standard | fp8,int8 | G,A,T | N | Y | [`MoEPrepareAndFinalizeNoEP`][vllm.model_executor.layers.fused_moe.prepare_finalize.MoEPrepareAndFinalizeNoEP] |
| BatchedPrepareAndFinalize<sup>5</sup> | batched | fp8,int8 | G,A,T | N | Y | [`BatchedPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedPrepareAndFinalize] |
!!! info "Table key" !!! info "Table key"
1. All types: mxfp4, nvfp4, int4, int8, fp8 1. All types: mxfp4, nvfp4, int4, int8, fp8
...@@ -69,7 +66,7 @@ Modular kernels are supported by the following `FusedMoEMethodBase` classes. ...@@ -69,7 +66,7 @@ Modular kernels are supported by the following `FusedMoEMethodBase` classes.
There are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adapters, so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties. There are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adapters, so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.
Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. `TritonExperts` and `BatchedTritonExperts`. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. `pplx` and `DeepEPLLPrepareAndFinalize`. Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. `TritonExperts` and `BatchedTritonExperts`. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. `DeepEPLLPrepareAndFinalize`.
Similar to the backend kernels, each experts kernel only supports certain quantization formats. For non-modular experts, the activations will be in the original type and quantized internally by the kernel. Modular experts will expect the activations to already be in the quantized format. Both types of experts will yield outputs in the original activation type. Similar to the backend kernels, each experts kernel only supports certain quantization formats. For non-modular experts, the activations will be in the original type and quantized internally by the kernel. Modular experts will expect the activations to already be in the quantized format. Both types of experts will yield outputs in the original activation type.
...@@ -77,12 +74,12 @@ Each experts kernel supports one or more activation functions, e.g. silu or gelu ...@@ -77,12 +74,12 @@ Each experts kernel supports one or more activation functions, e.g. silu or gelu
As with the backends, some experts support applying topk weights on the input activations. The entries in the column in this table only apply to the non-modular experts. As with the backends, some experts support applying topk weights on the input activations. The entries in the column in this table only apply to the non-modular experts.
Most experts flavors include an equivalent modular interface which will be a subclass of `FusedMoEPermuteExpertsUnpermute`. Most experts flavors include an equivalent modular interface which will be a subclass of `FusedMoEExpertsModular`.
To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels must have compatible activation formats, quantization types and quantization formats. To be used with a particular `FusedMoEPrepareAndFinalizeModular` subclass, MoE kernels must have compatible activation formats, quantization types and quantization formats.
| Kernel | Input act. format | Quant. types | Quant. format | Activation function | Apply Weight On Input | Modular | Source | | Kernel | Input act. format | Quant. types | Quant. format | Activation function | Apply Weight On Input | Modular | Source |
|--------|-------------------|--------------|---------------|---------------------|-----------------------|---------|--------| | ------ | ----------------- | ------------ | ------------- | ------------------- | --------------------- | ------- | ------ |
| triton | standard | all<sup>1</sup> | G,A,T | silu, gelu,</br>swigluoai,</br>silu_no_mul,</br>gelu_no_mul | Y | Y | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],</br>[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts] | | triton | standard | all<sup>1</sup> | G,A,T | silu, gelu,</br>swigluoai,</br>silu_no_mul,</br>gelu_no_mul | Y | Y | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],</br>[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts] |
| triton (batched) | batched | all<sup>1</sup> | G,A,T | silu, gelu | <sup>6</sup> | Y | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts] | | triton (batched) | batched | all<sup>1</sup> | G,A,T | silu, gelu | <sup>6</sup> | Y | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts] |
| deep gemm | standard,</br>batched | fp8 | G(128),A,T | silu, gelu | <sup>6</sup> | Y | </br>[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],</br>[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts] | | deep gemm | standard,</br>batched | fp8 | G(128),A,T | silu, gelu | <sup>6</sup> | Y | </br>[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],</br>[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts] |
...@@ -108,8 +105,8 @@ To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels ...@@ -108,8 +105,8 @@ To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels
The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts. The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts.
| backend | `FusedMoEPrepareAndFinalize` subclasses | `FusedMoEPermuteExpertsUnpermute` subclasses | | backend | `FusedMoEPrepareAndFinalizeModular` subclasses | `FusedMoEExpertsModular` subclasses |
|---------|-----------------------------------------|----------------------------------------------| | ------- | ---------------------------------------------- | ----------------------------------- |
| deepep_high_throughput | `DeepEPHTPrepareAndFinalize` | `DeepGemmExperts`,</br>`TritonExperts`,</br>`TritonOrDeepGemmExperts`,</br>`CutlassExpertsFp8`, </br>`MarlinExperts` | | deepep_high_throughput | `DeepEPHTPrepareAndFinalize` | `DeepGemmExperts`,</br>`TritonExperts`,</br>`TritonOrDeepGemmExperts`,</br>`CutlassExpertsFp8`, </br>`MarlinExperts` |
| deepep_low_latency,</br>pplx | `DeepEPLLPrepareAndFinalize`,</br>`PplxPrepareAndFinalize` | `BatchedDeepGemmExperts`,</br>`BatchedTritonExperts`,</br>`CutlassBatchedExpertsFp8`,</br>`BatchedMarlinExperts` | | deepep_low_latency | `DeepEPLLPrepareAndFinalize` | `BatchedDeepGemmExperts`,</br>`BatchedTritonExperts`,</br>`CutlassBatchedExpertsFp8`,</br>`BatchedMarlinExperts` |
| flashinfer | `FlashInferCutlassMoEPrepareAndFinalize` | `FlashInferExperts` | | flashinfer | `FlashInferCutlassMoEPrepareAndFinalize` | `FlashInferExperts` |
\ No newline at end of file
...@@ -12,9 +12,8 @@ page for information on known issues and how to solve them. ...@@ -12,9 +12,8 @@ page for information on known issues and how to solve them.
The use of Python multiprocessing in vLLM is complicated by: The use of Python multiprocessing in vLLM is complicated by:
- The use of vLLM as a library and the inability to control the code using vLLM - using vLLM as a library, which limits control over its internal code;
- Varying levels of incompatibilities between multiprocessing methods and vLLM - incompatibilities between certain multiprocessing methods and vLLM dependencies.
dependencies
This document describes how vLLM deals with these challenges. This document describes how vLLM deals with these challenges.
...@@ -22,11 +21,9 @@ This document describes how vLLM deals with these challenges. ...@@ -22,11 +21,9 @@ This document describes how vLLM deals with these challenges.
[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include: [Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include:
- `spawn` - spawn a new Python process. The default on Windows and macOS. - `spawn` - Spawn a new Python process. The default on Windows and macOS.
- `fork` - Use `os.fork()` to fork the Python interpreter. The default on - `fork` - Use `os.fork()` to fork the Python interpreter. The default on
Linux for Python versions prior to 3.14. Linux for Python versions prior to 3.14.
- `forkserver` - Spawn a server process that will fork a new process on request. - `forkserver` - Spawn a server process that will fork a new process on request.
The default on Linux for Python version 3.14 and newer. The default on Linux for Python version 3.14 and newer.
...@@ -36,8 +33,8 @@ This document describes how vLLM deals with these challenges. ...@@ -36,8 +33,8 @@ This document describes how vLLM deals with these challenges.
threads. If you are under macOS, using `fork` may cause the process to crash. threads. If you are under macOS, using `fork` may cause the process to crash.
`spawn` is more compatible with dependencies, but can be problematic when vLLM `spawn` is more compatible with dependencies, but can be problematic when vLLM
is used as a library. If the consuming code does not use a `__main__` guard (`if is used as a library. If the consuming code does not use a `__main__` guard
__name__ == "__main__":`), the code will be inadvertently re-executed when vLLM (`if __name__ == "__main__":`), the code will be inadvertently re-executed when vLLM
spawns a new process. This can lead to infinite recursion, among other problems. spawns a new process. This can lead to infinite recursion, among other problems.
`forkserver` will spawn a new server process that will fork new processes on `forkserver` will spawn a new server process that will fork new processes on
...@@ -57,8 +54,7 @@ Multiple vLLM dependencies indicate either a preference or requirement for using ...@@ -57,8 +54,7 @@ Multiple vLLM dependencies indicate either a preference or requirement for using
- <https://pytorch.org/docs/stable/multiprocessing.html#sharing-cuda-tensors> - <https://pytorch.org/docs/stable/multiprocessing.html#sharing-cuda-tensors>
- <https://docs.habana.ai/en/latest/PyTorch/Getting_Started_with_PyTorch_and_Gaudi/Getting_Started_with_PyTorch.html?highlight=multiprocessing#torch-multiprocessing-for-dataloaders> - <https://docs.habana.ai/en/latest/PyTorch/Getting_Started_with_PyTorch_and_Gaudi/Getting_Started_with_PyTorch.html?highlight=multiprocessing#torch-multiprocessing-for-dataloaders>
It is perhaps more accurate to say that there are known problems with using Known issues exist when using `fork` after initializing these dependencies.
`fork` after initializing these dependencies.
## Current State (v0) ## Current State (v0)
...@@ -66,8 +62,8 @@ The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control w ...@@ -66,8 +62,8 @@ The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control w
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/envs.py#L339-L342> - <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/envs.py#L339-L342>
When we know we own the process because the `vllm` command was used, we use If the main process is controlled via the `vllm` command,
`spawn` because it's the most widely compatible. `spawn` is used because it's the most widely compatible.
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/scripts.py#L123-L140> - <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/scripts.py#L123-L140>
...@@ -104,8 +100,8 @@ dependencies and code using vLLM as a library. ...@@ -104,8 +100,8 @@ dependencies and code using vLLM as a library.
### Changes Made in v1 ### Changes Made in v1
There is not an easy solution with Python's `multiprocessing` that will work There is not an easy solution with Python's `multiprocessing` that will work
everywhere. As a first step, we can get v1 into a state where it does "best everywhere. As a first step, we can get v1 into a state where it does
effort" choice of multiprocessing method to maximize compatibility. "best effort" choice of multiprocessing method to maximize compatibility.
- Default to `fork`. - Default to `fork`.
- Use `spawn` when we know we control the main process (`vllm` was executed). - Use `spawn` when we know we control the main process (`vllm` was executed).
...@@ -154,8 +150,8 @@ RuntimeError: ...@@ -154,8 +150,8 @@ RuntimeError:
### Detect if a `__main__` guard is present ### Detect if a `__main__` guard is present
It has been suggested that we could behave better if we could detect whether It has been suggested that we could behave better if we could detect whether
code using vLLM as a library has a `__main__` guard in place. This [post on code using vLLM as a library has a `__main__` guard in place. This
stackoverflow](https://stackoverflow.com/questions/77220442/multiprocessing-pool-in-a-python-class-without-name-main-guard) [post on Stack Overflow](https://stackoverflow.com/questions/77220442/multiprocessing-pool-in-a-python-class-without-name-main-guard)
was from a library author facing the same question. was from a library author facing the same question.
It is possible to detect whether we are in the original, `__main__` process, or It is possible to detect whether we are in the original, `__main__` process, or
...@@ -192,4 +188,4 @@ that works around these challenges. ...@@ -192,4 +188,4 @@ that works around these challenges.
2. We can explore other libraries that may better suit our needs. Examples to 2. We can explore other libraries that may better suit our needs. Examples to
consider: consider:
- <https://github.com/joblib/loky> - <https://github.com/joblib/loky>
<!-- markdownlint-disable -->
# Optimization Levels # Optimization Levels
## Overview ## Overview
vLLM now supports optimization levels (`-O0`, `-O1`, `-O2`, `-O3`). Optimization levels provide an intuitive mechanism for users to trade startup time for performance. Higher levels have better performance but worse startup time. These optimization levels have associated defaults to help users get desired out-of-the-box performance. Importantly, defaults set by optimization levels are purely defaults; explicit user settings will not be overwritten. vLLM provides 4 optimization levels (`-O0`, `-O1`, `-O2`, `-O3`) that allow users to trade off startup time for performance:
- `-O0`: No optimization. Fastest startup time, but lowest performance.
- `-O1`: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
- `-O2`: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
- `-O3`: Aggressive optimization. Currently equal to `-O2`, but may include additional time-consuming or experimental optimizations in the future.
All optimization level defaults can be achieved by manually setting the underlying flags.
User-set flags take precedence over optimization level defaults.
## Level Summaries and Usage Examples ## Level Summaries and Usage Examples
```bash ```bash
# CLI usage # CLI usage
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O0 python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1
# Python API usage # Python API usage
from vllm.entrypoints.llm import LLM from vllm.entrypoints.llm import LLM
llm = LLM( llm = LLM(
model="RedHatAI/Llama-3.2-1B-FP8", model="RedHatAI/Llama-3.2-1B-FP8",
optimization_level=0 optimization_level=2 # equivalent to -O2
) )
``` ```
#### `-O1`: Quick Optimizations ### `-O0`: No Optimization
- **Startup**: Moderate startup time
- **Performance**: Inductor compilation, CUDAGraphMode.PIECEWISE
- **Use case**: Balance for most development scenarios
```bash Startup as fast as possible - no autotuning, no compilation, and no cudagraphs.
# CLI usage This level is good for initial phases of development and debugging.
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1
# Python API usage Settings:
from vllm.entrypoints.llm import LLM
llm = LLM( - `-cc.cudagraph_mode=NONE`
model="RedHatAI/Llama-3.2-1B-FP8", - `-cc.mode=NONE` (also resulting in `-cc.custom_ops=["none"]`)
optimization_level=1 - `-cc.pass_config.fuse_...=False` (all fusions disabled)
) - `--kernel-config.enable_flashinfer_autotune=False`
```
#### `-O2`: Full Optimizations (Default) ### `-O1`: Fast Optimization
- **Startup**: Longer startup time
- **Performance**: `-O1` + CUDAGraphMode.FULL_AND_PIECEWISE
- **Use case**: Production workloads where performance is important. This is the default use case. It is also very similar to the previous default. The primary difference is that noop & fusion flags are enabled.
```bash Prioritize fast startup, but still enable basic optimizations like compilation and cudagraphs.
# CLI usage (default, so optional) This level is a good balance for most development scenarios where you want faster startup but
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O2 still make sure your code does not break cudagraphs or compilation.
# Python API usage Settings:
from vllm.entrypoints.llm import LLM
llm = LLM( - `-cc.cudagraph_mode=PIECEWISE`
model="RedHatAI/Llama-3.2-1B-FP8", - `-cc.mode=VLLM_COMPILE`
optimization_level=2 # This is the default - `--kernel-config.enable_flashinfer_autotune=True`
)
``` Fusions:
- `-cc.pass_config.fuse_norm_quant=True`*
- `-cc.pass_config.fuse_act_quant=True`*
- `-cc.pass_config.fuse_act_padding=True`
- `-cc.pass_config.fuse_rope_kvcache=True`† (will be moved to O2)
\* These fusions are only enabled when either op is using a custom kernel, otherwise Inductor fusion is better.</br>
† These fusions are ROCm-only and require AITER.
### `-O2`: Full Optimization (Default)
Prioritize performance at the expense of additional startup time.
This level is recommended for production workloads and is hence the default.
Fusions in this level _may_ take longer due to additional compile ranges.
Settings (on top of `-O1`):
- `-cc.cudagraph_mode=FULL_AND_PIECEWISE`
- `-cc.pass_config.fuse_allreduce_rms=True`
### `-O3`: Aggressive Optimization
#### `-O3`: Full Optimization This level is currently the same as `-O2`, but may include additional optimizations
Still in development. Added infrastructure to prevent changing API in future in the future that are more time-consuming or experimental.
release. Currently behaves the same O2.
## Troubleshooting ## Troubleshooting
...@@ -66,4 +83,4 @@ release. Currently behaves the same O2. ...@@ -66,4 +83,4 @@ release. Currently behaves the same O2.
1. **Startup Time Too Long**: Use `-O0` or `-O1` for faster startup 1. **Startup Time Too Long**: Use `-O0` or `-O1` for faster startup
2. **Compilation Errors**: Use `debug_dump_path` for additional debugging information 2. **Compilation Errors**: Use `debug_dump_path` for additional debugging information
3. **Performance Issues**: Ensure using `-O2` for production 3. **Performance Issues**: Ensure using `-O2` for production
\ No newline at end of file
...@@ -141,7 +141,7 @@ Every plugin has three parts: ...@@ -141,7 +141,7 @@ Every plugin has three parts:
- triton ops - triton ops
Custom way doesn't work for triton ops now. Custom way doesn't work for triton ops now.
7. (optional) Implement other plugable modules, such as lora, graph backend, quantization, mamba attention backend, etc. 7. (optional) Implement other pluggable modules, such as lora, graph backend, quantization, mamba attention backend, etc.
## Compatibility Guarantee ## Compatibility Guarantee
...@@ -155,3 +155,4 @@ The interface for the model/module may change during vLLM's development. If you ...@@ -155,3 +155,4 @@ The interface for the model/module may change during vLLM's development. If you
- `use_v1` parameter in `Platform.get_attn_backend_cls` is deprecated. It has been removed in v0.13.0. - `use_v1` parameter in `Platform.get_attn_backend_cls` is deprecated. It has been removed in v0.13.0.
- `_Backend` in `vllm.attention` is deprecated. It has been removed in v0.13.0. Please use `vllm.v1.attention.backends.registry.register_backend` to add new attention backend to `AttentionBackendEnum` instead. - `_Backend` in `vllm.attention` is deprecated. It has been removed in v0.13.0. Please use `vllm.v1.attention.backends.registry.register_backend` to add new attention backend to `AttentionBackendEnum` instead.
- `seed_everything` platform interface is deprecated. It has been removed in v0.16.0. Please use `vllm.utils.torch_utils.set_random_seed` instead. - `seed_everything` platform interface is deprecated. It has been removed in v0.16.0. Please use `vllm.utils.torch_utils.set_random_seed` instead.
- `prompt` in `Platform.validate_request` is deprecated. It has been removed in v0.18.0.
...@@ -26,7 +26,7 @@ This feature is off by default, but can be enabled by setting `compile_mm_encode ...@@ -26,7 +26,7 @@ This feature is off by default, but can be enabled by setting `compile_mm_encode
To compile a multimodal component such as an encoder, we follow the same mechanism as the LLM text backbone, with a few additional scaffoldings: To compile a multimodal component such as an encoder, we follow the same mechanism as the LLM text backbone, with a few additional scaffoldings:
1. The `@support_torch_compile` decorator should include `enable_if=should_torch_compile_mm_vit`. This will gate the compilation behind our 1. The `@support_torch_compile` decorator should include `enable_if=should_torch_compile_mm_encoder`. This will gate the compilation behind our
`compile_mm_encoder` configuration `compile_mm_encoder` configuration
2. `with set_model_tag("<component_name>", is_encoder=True)` context manager should be used around the nn.Module's instantiation. Since torch.compile 2. `with set_model_tag("<component_name>", is_encoder=True)` context manager should be used around the nn.Module's instantiation. Since torch.compile
...@@ -34,9 +34,6 @@ relies on caching artifacts to reduce start time, we must properly propagate the ...@@ -34,9 +34,6 @@ relies on caching artifacts to reduce start time, we must properly propagate the
with the LLM text-backbone, or other instances of the same artifact (as is the case with vision block). `is_encoder=True` is also needed for encoder with the LLM text-backbone, or other instances of the same artifact (as is the case with vision block). `is_encoder=True` is also needed for encoder
components (see Compile Range Integration). components (see Compile Range Integration).
3. `with set_forward_context` context manager should be used around the nn.Module's forward call. This will properly forward the vllm_config which is needed
for torch.compile integration.
### CompilationConfig ### CompilationConfig
With the exception of `compile_mm_encoder: true`, the multimodal encoder will inherit from the same compilation config as the text LLM. We may extend With the exception of `compile_mm_encoder: true`, the multimodal encoder will inherit from the same compilation config as the text LLM. We may extend
......
...@@ -36,12 +36,12 @@ th:not(:first-child) { ...@@ -36,12 +36,12 @@ th:not(:first-child) {
} }
</style> </style>
| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) | | Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](speculative_decoding/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | | | [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
| [SD](spec_decode/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | | [SD](speculative_decoding/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | | [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
...@@ -59,23 +59,23 @@ th:not(:first-child) { ...@@ -59,23 +59,23 @@ th:not(:first-child) {
### Feature x Hardware ### Feature x Hardware
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | Intel GPU | | Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | Intel GPU |
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------| ------------| | ------- | ----- | ------ | ------ | --- | ------ | --- | --- | --------- |
| [CP](../configuration/optimization.md#chunked-prefill) | [](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [CP](../configuration/optimization.md#chunked-prefill) | [](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [APC](automatic_prefix_caching.md) | [](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [SD](spec_decode/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | [SD](speculative_decoding/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [](https://github.com/vllm-project/vllm/issues/26970) | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [](https://github.com/vllm-project/vllm/issues/26970) |
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | | <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ✅ | | multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ✅ |
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
!!! note !!! note
For information on feature support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation. For information on feature support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment