Commit c3270a92 authored by 王敏's avatar 王敏
Browse files

Merge remote-tracking branch 'origin/v0.15.1-dev' into v0.15.1-dev

parents feced2f1 0b7cc6cf
# <div align="center"><strong>DCU vLLM</strong></div>
## vLLM_dcu简介
vLLM 是一个快速易用的 LLM 推理和服务库。可用于大型语言模型和多模态模型的高性能服务框架,旨在在从单个GPU到大型分布式集群的各种设置中提供低延迟和高吞吐量的推理,我们基于开源社区做了DCU平台的适配和针对性的优化。
其核心功能包括:快速运行时:通过PagedAttention提供高效的服务,用于前缀缓存、零开销CPU调度器、预填充解码分解、推测解码、连续批处理、分页注意力、张量/流水线/专家/数据并行性、结构化输出、分块预填充、量化(FP4/FP8/INT4/AWQ/GPTQ)和多LoRA批处理。
广泛的模型支持:支持各种语言模型(Llama、Qwen、DeepSeek、Kimi、GLM、GPT、Gemma、Mistral等)、嵌入模型(E5-Mistral、GTE、ColBERT)和奖励模型(Qwen-Math),易于扩展以添加新模型。与大多数Hugging Face模型和OpenAI API兼容。
强化学习和训练后主干:vLLM是一个经过验证的全球推广后端,具有原生强化学习集成,并被知名训练后框架采用。
## 支持模型结构列表
| 结构 | 模型 | FP16/BF16 | AWQ | GPTQ | 支持版本 | 是否优化 |
| :---------------------------------: | :------: | :------: | :------: |:------: | :------: |:------: |
| LlamaForCausalLM | Llama 3.2, Llama 3.1,Llama 3,Llama 2,Llama,Yi,Codellama,DeepSeek-R1-Distill-Llama | Yes | Yes | Yes | v0.5.0,Llama 3.2>=v0.6.2 | Yes |
| Llama4ForConditionalGeneration | Llama 4 | No/Yes | - | - | v0.8.5.post1 | No |
| QWenLMHeadModel | QWen,Qwen-VL | Yes | Yes | Yes | v0.5.0,Qwen-VL>=v0.6.2 | Yes |
| Qwen2ForCausalLM | QWen2,QWen1.5,CodeQwen1.5,DeepSeek-R1-Distill-Qwen,gte_Qwen2-1.5B-instruct | Yes | Yes | Yes | v0.5.0,gte>=v0.7.2 | Yes |
| Qwen3ForCausalLM | QWen3,Qwen3-Embedding,Qwen3-Reranker | Yes | - | - | v0.8.4 | Yes |
| Qwen3MoeForCausalLM | QWen3MoE | Yes | - | - | v0.8.4 | Yes |
| Qwen3NextForCausalLM | QWen3-Next | Yes | - | - | v0.11.0 | Yes |
| ChatGLMModel | glm-4v-9b,chatglm3,chatglm2 | Yes | No | Yes | v0.5.0 | Yes |
| Glm4ForCausalLM | GLM-4-0414 | No/Yes | - | - | v0.8.5.post1 | Yes |
| Glm4MoeForCausalLM | GLM-4.5,GLM-4.6,GLM-4.7,GLM-4.5-Air | Yes | - | - | v0.9.2 | Yes |
| Glm4vMoeForConditionalGeneration | GLM-4.5V | Yes | - | - | v0.11.0 | Yes |
| DeepseekForCausalLM | Deepseek | Yes | No | - | v0.5.0 | Yes |
| DeepseekV2ForCausalLM | DeepSeek-V2 | Yes | No | - | v0.6.2 | Yes |
| DeepseekVLV2ForCausalLM | DeepSeek-VL2 | Yes | No | - | v0.7.2 | Yes |
| DeepseekV3ForCausalLM | DeepSeek-V3 | Yes | Yes | - | v0.7.2 | Yes |
| DeepseekV32ForCausalLM | DeepSeek-V3.2 | Yes | Yes | - | v0.11.0 | No |
| GptOssForCausalLM | gpt-oss | Yes | - | - | v0.11.0 | Yes |
| BaiChuanForCausalLM | Baichuan2,Baichuan | Yes | No | No | v0.11.0 | Yes |
| BloomForCausalLM | BLOOM | Yes | No | Yes | v0.5.0 | Yes |
| InternLMForCausalLM | InternLM | Yes | No | - | v0.5.0 | Yes |
| InternLM2ForCausalLM | InternLM2 | Yes | No | - | v0.5.0 | Yes |
| FalconForCausalLM | falcon | Yes | No | Yes | v0.5.0 | Yes |
| TeleChat2ForCausalLM | TeleChat2 | Yes | No | - | v0.7.2 | Yes |
| MiniCPMForCausalLM | MiniCPM | Yes | No | - | v0.5.0 | Yes |
| MiniCPM3ForCausalLM | MiniCPM3 | Yes | No | - | v0.6.2 | Yes |
| MixtralForCausalLM | Mixtral-8x7B,Mixtral-8x7B-Instruct | Yes | No | - | v0.5.0 | Yes |
| Qwen2MoeForCausalLM | Qwen2-57B-A14B,Qwen2-57B-A14B-Instruct | Yes | No | - | v0.5.0 | No |
| LlavaForConditionalGeneration | LLaMA,LLaMA-2,LLaMA-3 | Yes | No | - | v0.6.2 | No |
| Qwen2VLForConditionalGeneration | Qwen2-VL | Yes | No | Yes | v0.6.2 | No |
| Qwen2_5_VLForConditionalGeneration | Qwen2.5-VL | Yes | No | Yes | v0.7.2 | No |
| Qwen3VLForConditionalGeneration | Qwen3-VL | Yes | No | Yes | v0.11.0 | No |
| Mistral3ForConditionalGeneration | Mistral3 | Yes | No | - | v0.8.5.post1 | No |
| Gemma3ForConditionalGeneration | Gemma 3 | Yes | - | - | v0.8.5.post1 | No |
| MiniCPMV | MiniCPM-V | Yes | No | - | v0.6.2 | No |
| Phi3VForCausalLM | Phi-3.5-vision | Yes | No | - | v0.6.2 | No |
| BertModel | bge-large-zh-v1.5 | Yes | No | - | v0.7.2 | No |
| XLMRobertaModel | bge-m3 | Yes | No | - | v0.7.2 | No |
| XLMRobertaForSequenceClassification | bge-reranker-v2-m3 | Yes | No | - | v0.7.2 | No |
## 使用源码编译方式安装
提供2种环境准备方式:
1. 基于光源pytorch2.9.0基础镜像环境:根据pytorch2.9.0、python、dtk及系统下载对应的镜像版本。
2. 基于现有python环境:安装pytorch2.9.0,pytorch whl包下载目录:https://cancon.hpccube.com:65024/4/main/pytorch,根据python、dtk版本,下载对应pytorch2.5.1的whl包。安装命令如下:
```shell
pip install torch* (下载的torch的whl包)
pip install setuptools wheel
```
### 源码编译安装
```shell
git clone http://10.16.6.30/dcutoolkit/deeplearing/vllm.git # 根据需要的分支进行切换
```
安装依赖:
```shell
pip install -r requirements/rocm.txt
```
- 提供2种源码编译方式(进入vllm目录):
```
1. 编译whl包并安装
python setup.py bdist_wheel
cd dist
pip install vllm*
2. 源码编译安装
python3 setup.py install (若调试,可使用python3 setup.py develop)
```
若需要添加git号,设置环境变量: export ADD_GIT_VERSION=1
### 运行基础环境准备
1、使用上面基于光源pytorch2.9.0基础镜像环境
2、根据pytorch2.9.0、python、dtk及系统下载对应的依赖包:
- triton:[https://cancon.hpccube.com:65024/4/main/triton](https://cancon.hpccube.com:65024/4/main/triton/)
- flash_attn: [https://cancon.hpccube.com:65024/4/main/flash_attn](https://cancon.hpccube.com:65024/4/main/flash_attn)
- flash_mla: [https://cancon.hpccube.com:65024/4/main/flash_mla](https://cancon.hpccube.com:65024/4/main/flash_mla)
- lightop: [https://cancon.hpccube.com:65024/4/main/lightop](https://cancon.hpccube.com:65024/4/main/lightop)
- lmslim: [https://cancon.hpccube.com:65024/4/main/lmslim](https://cancon.hpccube.com:65024/4/main/lmslim)
### 注意事项
+ 若使用 pip install 下载安装过慢,可添加源: -i https://pypi.tuna.tsinghua.edu.cn/simple/
## 验证
- python -c "import vllm; print(vllm.\_\_version__)",版本号与官方版本同步,查询该软件的版本号,例如0.15.1;
## PD 分离
#### 注释:enable_multiple_machines:true:是否是跨机的这里P和D的服务都要设置,只要有一个跨机,就要设置true;enable_asymmetric_p2p:是否是非对称切分;remote_tp_size:D的tpsize;remote_pp_size:D的ppsize (这里的非对成切分支持mla的模型)
### 环境变量
```bash
export NCCL_NCHANNELS_PER_PEER=2
export IP_CONFIG_FILE=/data/ip_config.txt ## 第一个ip为D的第一个节点,第二个ip为D的第二个节点
export NCCL_IB_HCA=,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export VLLM_HOST_IP=10.16.1.76 #ip地址 不同的节点这个需要对应修改
export NCCL_SOCKET_IFNAME=enp33s0f3u1
export GLOO_SOCKET_IFNAME=enp33s0f3u1
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_NET_GDR_READ=1
```
## P、D单实例单机的任意切分方式(满足D的tp>=P的tp)使用。
### 代理
```bash
在P的节点,例子里是75节点:
cd vllm/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd
python3 disagg_proxy_p2p_nccl_xpyd.py
特别注意,这里如果服务重启,代理也需要重启
```
### P的运行指令:
```bash
vllm serve /module/DeepSeek-R1-W4A8-V2/ --port 20011 --trust-remote-code --dtype bfloat16 --max-model-len 49152 --max-num-batched-tokens 8192 -tp 1 -pp 8 --gpu-memory-utilization 0.9 --max-num-seqs 256 --disable-log-requests --block-size 64 --enforce-eager -q slimquant_w4a8_marlin --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 2}' --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"enable_asymmetric_p2p":true,"remote_tp_size":2,"remote_pp_size":4,"proxy_ip":"10.16.1.75","proxy_port":"30001","http_port":"20011","send_type":"PUT_ASYNC"}}' --kv-cache-dtype fp8_e5m2
```
### D的运行指令:
```bash
vllm serve /module/DeepSeek-R1-W4A8-V2/ --host 0.0.0.0 --port 20009 --trust-remote-code --dtype bfloat16 -q slimquant_w4a8_marlin --max-model-len 16484 -tp 2 -pp 4 --gpu-memory-utilization 0.90 --max-num-seqs 100 --block-size 64 --disable-log-requests --max-num-batched-tokens 16484 --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 2}' --kv-cache-dtype fp8_e5m2 --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"1e8","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.16.1.75","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","mem_pool_size_gb":128}}'
```
## P:PP2 TP8 D:TP8
### 代理
```bash
在P的节点,例子里是75节点:
cd vllm/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd
python3 disagg_proxy_p2p_nccl_xpyd_mult_mac.py # 最新版本执行,老版本没有这个文件,就执行disagg_proxy_p2p_nccl_xpyd.py
```
### P的运行指令:
```bash
在75节点运行:ray start --head --node-ip-address=10.16.1.75 --port=8244 --num-gpus=8 --num-cpus=32
在76节点运行:ray start --address='10.16.1.75:8244' --num-gpus=8 --num-cpus=32
在75节点启动服务:vllm serve /module/DeepSeek-R1-W4A8-V2/ --host 0.0.0.0 --port 20005 --trust-remote-code --distributed-executor-backend ray --dtype bfloat16 --max-model-len 32768 -tp 8 -pp 2 --gpu-memory-utilization 0.90 --max-num-seqs 256 --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 2}' --disable-log-requests --block-size 64 --enable-chunked-prefill --max-num-batched-tokens 6144 --no-enable-prefix-caching --enforce-eager --kv-cache-dtype fp8_e5m2 -q slimquant_marlin --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"enable_multiple_machines":true,"enable_asymmetric_p2p":false,"remote_tp_size":8,"remote_pp_size":1,"proxy_ip":"10.16.1.75","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","mem_pool_size_gb":64}}'
```
### D的运行指令:
```bash
vllm serve /module/DeepSeek-R1-W4A8-V2/ --host 0.0.0.0 --port 20009 --trust-remote-code --dtype bfloat16 -q slimquant_w4a8_marlin --max-model-len 16484 -tp 8 --gpu-memory-utilization 0.90 --max-num-seqs 100 --block-size 64 --disable-log-requests --max-num-batched-tokens 16484 --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 2}' --kv-cache-dtype fp8_e5m2 --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"1e8","kv_port":"22001","kv_connector_extra_config":{"enable_multiple_machines":true,"proxy_ip":"10.16.1.75","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","mem_pool_size_gb":128}}'
```
## P:PP2 TP8 D:PP2 TP8
### 代理
```bash
在P的节点,例子里是75节点:
cd /data/vllm092_dev_xiabo/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd
python3 disagg_proxy_p2p_nccl_xpyd_mult_mac.py # 最新版本执行,老版本没有这个文件,就执行disagg_proxy_p2p_nccl_xpyd.py
```
### P的运行指令:
```bash
在75节点运行:ray start --head --node-ip-address=10.16.1.75 --port=8244 --num-gpus=8 --num-cpus=32
在76节点运行:ray start --address='10.16.1.75:8244' --num-gpus=8 --num-cpus=32
在75节点启动服务:vllm serve /module/DeepSeek-R1-W4A8-V2/ --host 0.0.0.0 --port 20005 --trust-remote-code --distributed-executor-backend ray --dtype bfloat16 --max-model-len 32768 -tp 8 -pp 2 --gpu-memory-utilization 0.90 --max-num-seqs 256 --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 2}' --disable-log-requests --block-size 64 --enable-chunked-prefill --max-num-batched-tokens 6144 --no-enable-prefix-caching --enforce-eager --kv-cache-dtype fp8_e5m2 -q slimquant_marlin --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"enable_multiple_machines":true,"enable_asymmetric_p2p":false,"remote_tp_size":8,"remote_pp_size":1,"proxy_ip":"10.16.1.75","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","mem_pool_size_gb":64}}'
```
### D的运行指令:
```bash
在77节点运行:ray start --head --node-ip-address=10.16.1.77 --port=9244 --num-gpus=8 --num-cpus=32
在26节点运行:ray start --address='10.16.1.77:9244' --num-gpus=8 --num-cpus=32
在77节点启动服务:vllm serve /module/DeepSeek-R1-W4A8-V2/ --host 0.0.0.0 --port 20009 --trust-remote-code --dtype bfloat16 -q slimquant_w4a8_marlin --max-model-len 16484 -tp 8 -pp 2 --gpu-memory-utilization 0.90 --max-num-seqs 100 --block-size 64 --disable-log-requests --max-num-batched-tokens 16484 --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 2}' --kv-cache-dtype fp8_e5m2 --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"1e8","kv_port":"22001","kv_connector_extra_config":{"enable_multiple_machines":true,"proxy_ip":"10.16.1.75","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","mem_pool_size_gb":128}}'
```
## low_latency (使用deepep)
```bash
export VLLM_MOE_DP_CHUNK_SIZE=128
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_USE_LIGHTOP=1
# deep_ep
export NCCL_NET_GDR_LEVEL=7
export NCCL_SDMA_COPY_ENABLE=0
export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export ROCSHMEM_HEAP_SIZE=4000000000
export ROCSHMEM_TOPO_FILE_FORCE=/work/topo.config
export USE_SPE_MQP=1
export ROCSHMEM_SQ_SIZE=1024
export ROCSHMEM_GDA_NUM_QPS_DEFAULT_CTX=256
```
topo.config
```YAML
0000:9f:00.0 mlx5_2 2
0000:57:00.0 mlx5_3 3
0000:5e:00.0 mlx5_4 4
0000:05:00.0 mlx5_5 5
0000:e5:00.0 mlx5_6 6
0000:c1:00.0 mlx5_7 7
0000:cc:00.0 mlx5_8 8
0000:b1:00.0 mlx5_9 9
```
单机ep8dp8部署示例
```bash
vllm serve /models/GLM-5-W8A8 \
--disable-log-requests \
-q slimquant_marlin \
--trust-remote-code \
-dp 8 \
-tp 1 \
--enable-expert-parallel \
--disable-custom-all-reduce \
--dtype bfloat16 \
--enable-chunked-prefill \
--max-model-len 50000 \
--max-num-batched-tokens 128 \
--max-num-seqs 32 \
--enable-prefix-caching \
--block-size 64 \
--gpu-memory-utilization 0.88 \
--kv-cache-dtype fp8_ds_mla \
-cc '{"inductor_compile_config":{"combo_kernels": false}}' \
--speculative_config '{"method":"mtp","num_speculative_tokens":3, "quantization": "slimquant_marlin"}'
```
双机ep16dp16部署示例
```bash
#node1 作为主节点
vllm serve /models/GLM-5-W8A8 \
--disable-log-requests \
-q slimquant_marlin \
--trust-remote-code \
-dp 16 \
-tp 1 \
--enable-expert-parallel \
--disable-custom-all-reduce \
--dtype bfloat16 \
--enable-chunked-prefill \
--max-model-len 72000 \
--max-num-batched-tokens 128 \
--max-num-seqs 32 \
--enable-prefix-caching \
--block-size 64 \
--gpu-memory-utilization 0.88 \
--data-parallel-size-local 8 \
--data-parallel-address ${node1_ip} \
--data-parallel-rpc-port 1127 \
--data-parallel-start-rank 0 \
--kv-cache-dtype fp8_ds_mla \
-cc '{"inductor_compile_config":{"combo_kernels": false}}' \
--speculative_config '{"method":"mtp","num_speculative_tokens":3, "quantization": "slimquant_marlin"}'
#node2
vllm serve /models/GLM-5-W8A8 \
--disable-log-requests \
-q slimquant_marlin \
--trust-remote-code \
-dp 16 \
-tp 1 \
--enable-expert-parallel \
--disable-custom-all-reduce \
--dtype bfloat16 \
--enable-chunked-prefill \
--max-model-len 72000 \
--max-num-batched-tokens 128 \
--max-num-seqs 32 \
--enable-prefix-caching \
--block-size 64 \
--gpu-memory-utilization 0.88 \
--data-parallel-size-local 8 \
--data-parallel-address ${node1_ip} \
--data-parallel-rpc-port 1127 \
--data-parallel-start-rank 8 \
--kv-cache-dtype fp8_ds_mla \
--headless \
-cc '{"inductor_compile_config":{"combo_kernels": false}}' \
--speculative_config '{"method":"mtp","num_speculative_tokens":3, "quantization": "slimquant_marlin"}'
```
## Known Issue
-
## 参考资料
- [README_ORIGIN](README_ORIGIN.md)
- [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
......@@ -25,7 +25,7 @@ fastrlock==0.8.3
# cupy==12.3.0
torch == 2.9.0
triton == 3.3.0
triton == 3.3.1
flash_attn == 2.8.3
flash_mla == 1.0.0
lightop == 0.6.0
......
model_name: "Qwen/Qwen3-4B"
accuracy_threshold: 0.78
num_questions: 1319
num_fewshot: 5
server_args: "--kv-cache-dtype turboquant_k3v4_nc --enforce-eager --max-model-len 4096"
model_name: "Qwen/Qwen3-4B"
accuracy_threshold: 0.80
num_questions: 1319
num_fewshot: 5
server_args: "--kv-cache-dtype turboquant_k8v4 --enforce-eager --max-model-len 4096"
model_name: "Qwen/Qwen3-4B"
accuracy_threshold: 0.75
num_questions: 1319
num_fewshot: 5
server_args: "--kv-cache-dtype turboquant_3bit_nc --enforce-eager --max-model-len 4096"
model_name: "Qwen/Qwen3-4B"
accuracy_threshold: 0.80
num_questions: 1319
num_fewshot: 5
server_args: "--kv-cache-dtype turboquant_4bit_nc --enforce-eager --max-model-len 4096"
Qwen3-4B-TQ-k8v4.yaml
Qwen3-4B-TQ-t4nc.yaml
Qwen3-4B-TQ-k3v4nc.yaml
Qwen3-4B-TQ-t3nc.yaml
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Unit tests for TurboQuant KV-cache quantization.
Run: .venv/bin/python -m pytest tests/quantization/test_turboquant.py -v
"""
import math
import pytest
import torch
from vllm.model_executor.layers.quantization.turboquant.config import (
TQ_PRESETS,
TurboQuantConfig,
)
from vllm.utils.math_utils import next_power_of_2
# ============================================================================
# Helpers
# ============================================================================
ALL_PRESETS = list(TQ_PRESETS.keys())
def _assert_strictly_sorted(seq, name="sequence"):
for i in range(len(seq) - 1):
assert seq[i] < seq[i + 1], f"{name} not sorted at index {i}"
def _is_power_of_2(n: int) -> bool:
return n > 0 and next_power_of_2(n) == n
# Expected concrete values for each preset at head_dim=128.
# fmt: off
PRESET_EXPECTED = {
"turboquant_k8v4": dict(
key_fp8=True, key_quant_bits=8,
key_mse_bits=0, value_quant_bits=4,
mse_bits=4, n_centroids=16, centroid_bits=4,
norm_correction=False,
key_packed_size=128, value_packed_size=68,
slot_size=196, slot_size_aligned=196,
),
"turboquant_4bit_nc": dict(
key_fp8=False, key_quant_bits=4,
key_mse_bits=4, value_quant_bits=4,
mse_bits=4, n_centroids=16, centroid_bits=4,
norm_correction=True,
key_packed_size=68, value_packed_size=68,
slot_size=136, slot_size_aligned=136,
),
"turboquant_k3v4_nc": dict(
key_fp8=False, key_quant_bits=3,
key_mse_bits=3, value_quant_bits=4,
mse_bits=3, n_centroids=8, centroid_bits=3,
norm_correction=True,
key_packed_size=52, value_packed_size=68,
slot_size=120, slot_size_aligned=120,
),
"turboquant_3bit_nc": dict(
key_fp8=False, key_quant_bits=3,
key_mse_bits=3, value_quant_bits=3,
mse_bits=3, n_centroids=8, centroid_bits=3,
norm_correction=True,
key_packed_size=52, value_packed_size=52,
slot_size=104, slot_size_aligned=104,
),
}
# fmt: on
# ============================================================================
# Config tests (CPU-only, no dependencies beyond config.py)
# ============================================================================
class TestTurboQuantConfig:
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_preset_parses(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert isinstance(cfg, TurboQuantConfig)
def test_invalid_preset_raises(self):
with pytest.raises(ValueError, match="Unknown TurboQuant"):
TurboQuantConfig.from_cache_dtype("turboquant_invalid", head_dim=128)
# ---- Per-preset concrete value checks (table-driven) ----
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_key_mode(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
exp = PRESET_EXPECTED[preset]
assert cfg.key_fp8 is exp["key_fp8"]
assert cfg.key_quant_bits == exp["key_quant_bits"]
assert cfg.key_mse_bits == exp["key_mse_bits"]
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_value_mode(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
exp = PRESET_EXPECTED[preset]
assert cfg.value_quant_bits == exp["value_quant_bits"]
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_bits_and_centroids(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
exp = PRESET_EXPECTED[preset]
assert cfg.mse_bits == exp["mse_bits"]
assert cfg.n_centroids == exp["n_centroids"]
assert cfg.centroid_bits == exp["centroid_bits"]
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_norm_correction(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert cfg.norm_correction is PRESET_EXPECTED[preset]["norm_correction"]
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_packed_sizes(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
exp = PRESET_EXPECTED[preset]
assert cfg.key_packed_size == exp["key_packed_size"]
assert cfg.value_packed_size == exp["value_packed_size"]
assert cfg.slot_size == exp["slot_size"]
assert cfg.slot_size_aligned == exp["slot_size_aligned"]
# ---- Cross-preset structural invariants ----
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_slot_equals_key_plus_value(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert cfg.slot_size == cfg.key_packed_size + cfg.value_packed_size
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_padded_slot_is_even(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert cfg.slot_size_aligned >= cfg.slot_size
assert cfg.slot_size_aligned % 2 == 0, (
f"slot_size_aligned={cfg.slot_size_aligned} is not even"
)
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_key_value_packed_sizes_positive(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert cfg.key_packed_size > 0
assert cfg.value_packed_size > 0
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_n_centroids_is_2_to_mse_bits(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert cfg.n_centroids == 2**cfg.mse_bits
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_centroid_bits_always_positive(self, preset):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
assert cfg.centroid_bits > 0
@pytest.mark.parametrize("preset", ALL_PRESETS)
def test_mse_key_or_fp8_exclusive(self, preset):
"""Each preset is either FP8 keys or MSE keys, never both."""
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=128)
if cfg.key_fp8:
assert cfg.key_mse_bits == 0
assert cfg.key_quant_bits == 8
else:
assert cfg.key_mse_bits > 0
assert cfg.key_quant_bits in (3, 4)
@pytest.mark.parametrize("preset", ALL_PRESETS)
@pytest.mark.parametrize("head_dim", [64, 96, 128, 256])
def test_all_presets_all_head_dims(self, preset, head_dim):
cfg = TurboQuantConfig.from_cache_dtype(preset, head_dim=head_dim)
assert cfg.head_dim == head_dim
assert cfg.slot_size == cfg.key_packed_size + cfg.value_packed_size
assert cfg.slot_size_aligned >= cfg.slot_size
assert cfg.slot_size_aligned % 2 == 0
# ---- Boundary skip layers ----
def test_boundary_skip_layers_basic(self):
layers = TurboQuantConfig.get_boundary_skip_layers(32)
assert layers == ["0", "1", "30", "31"]
def test_boundary_skip_layers_zero(self):
assert TurboQuantConfig.get_boundary_skip_layers(32, 0) == []
def test_boundary_skip_layers_small_model(self):
layers = TurboQuantConfig.get_boundary_skip_layers(4)
assert layers == ["0", "1", "2", "3"]
def test_boundary_skip_layers_cap_at_half(self):
layers = TurboQuantConfig.get_boundary_skip_layers(8, 10)
assert len(layers) == 8
# ============================================================================
# Centroids tests (CPU-only)
# ============================================================================
from vllm.model_executor.layers.quantization.turboquant.centroids import (
get_centroids,
solve_lloyd_max,
)
class TestCentroids:
@pytest.mark.parametrize("bits,expected_n", [(2, 4), (3, 8), (4, 16)])
def test_centroids_shape(self, bits, expected_n):
c = get_centroids(128, bits)
assert c.shape == (expected_n,)
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_centroids_sorted(self, bits):
_assert_strictly_sorted(get_centroids(128, bits), "centroids")
def test_centroids_cached(self):
c1 = get_centroids(128, 3)
c2 = get_centroids(128, 3)
assert c1 is c2, "get_centroids should return cached object"
def test_centroids_different_dims_not_identical(self):
c64 = get_centroids(64, 3)
c128 = get_centroids(128, 3)
assert not torch.equal(c64, c128)
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_centroids_symmetric_around_zero(self, bits):
"""N(0, 1/d) is symmetric, so centroids should be ~symmetric."""
c = get_centroids(128, bits)
assert abs(c.mean().item()) < 0.01, "Centroids not centered near 0"
assert abs(c[0].item() + c[-1].item()) < 0.01
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_centroids_within_4sigma(self, bits):
"""All centroids should be within ~4 sigma of N(0, 1/d)."""
sigma = math.sqrt(1.0 / 128)
c = get_centroids(128, bits)
for i, val in enumerate(c):
assert abs(val.item()) < 4 * sigma, (
f"Centroid {i}={val:.6f} outside 4*sigma={4 * sigma:.6f}"
)
class TestLloydMax:
@pytest.mark.parametrize("bits,expected_n", [(2, 4), (3, 8), (4, 16)])
def test_solve_shapes(self, bits, expected_n):
centroids, boundaries = solve_lloyd_max(128, bits)
assert centroids.shape == (expected_n,)
assert boundaries.shape == (expected_n - 1,)
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_centroids_sorted(self, bits):
centroids, _ = solve_lloyd_max(128, bits)
_assert_strictly_sorted(centroids, "centroids")
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_boundaries_sorted(self, bits):
_, boundaries = solve_lloyd_max(128, bits)
_assert_strictly_sorted(boundaries, "boundaries")
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_boundaries_between_centroids(self, bits):
"""Each boundary must lie between its adjacent centroids."""
centroids, boundaries = solve_lloyd_max(128, bits)
for i in range(len(boundaries)):
assert centroids[i] < boundaries[i] < centroids[i + 1], (
f"Boundary {i}={boundaries[i]:.6f} not between "
f"c[{i}]={centroids[i]:.6f} and c[{i + 1}]={centroids[i + 1]:.6f}"
)
@pytest.mark.parametrize("bits", [2, 3, 4])
def test_boundaries_are_midpoints(self, bits):
"""Lloyd-Max boundaries are midpoints of adjacent centroids."""
centroids, boundaries = solve_lloyd_max(128, bits)
for i in range(len(boundaries)):
expected = (centroids[i] + centroids[i + 1]) / 2.0
assert abs(boundaries[i].item() - expected.item()) < 1e-6
def test_solve_deterministic(self):
c1, b1 = solve_lloyd_max(128, 3)
c2, b2 = solve_lloyd_max(128, 3)
assert torch.equal(c1, c2)
assert torch.equal(b1, b2)
def test_solve_dtype_float32(self):
centroids, boundaries = solve_lloyd_max(128, 3)
assert centroids.dtype == torch.float32
assert boundaries.dtype == torch.float32
# ============================================================================
# Rotation matrix tests (GPU required)
# ============================================================================
CUDA_AVAILABLE = torch.cuda.is_available()
from vllm.model_executor.layers.quantization.turboquant.quantizer import (
generate_rotation_matrix,
)
@pytest.mark.skipif(not CUDA_AVAILABLE, reason="CUDA not available")
class TestRotationMatrix:
@pytest.mark.parametrize("dim", [64, 96, 128, 256])
def test_rotation_matrix_shape_and_orthogonal(self, dim):
Pi = generate_rotation_matrix(dim, seed=42, device="cuda")
assert Pi.shape == (dim, dim)
eye = Pi @ Pi.T
assert torch.allclose(eye, torch.eye(dim, device="cuda"), atol=1e-5), (
f"Pi not orthogonal for dim={dim}"
)
def test_rotation_matrix_deterministic(self):
Pi1 = generate_rotation_matrix(128, seed=42)
Pi2 = generate_rotation_matrix(128, seed=42)
assert torch.equal(Pi1, Pi2)
def test_rotation_matrix_different_seeds(self):
Pi1 = generate_rotation_matrix(128, seed=42)
Pi2 = generate_rotation_matrix(128, seed=99)
assert not torch.equal(Pi1, Pi2)
def test_rotation_matrix_det_is_pm1(self):
"""Orthogonal matrix determinant must be +1 or -1."""
Pi = generate_rotation_matrix(128, seed=42, device="cuda")
det = torch.linalg.det(Pi)
assert abs(abs(det.item()) - 1.0) < 1e-4
......@@ -372,3 +372,26 @@ def test_fp8_reloading(
weight_loader(param, torch.zeros(shape)) # cannot use empty
method.process_weights_after_loading(layer)
@pytest.mark.skipif(
not is_quant_method_supported("fp8"),
reason="FP8 is not supported on this GPU type.",
)
def test_kv_cache_dtype_skip_layers(vllm_runner, monkeypatch):
"""Test that kv_cache_dtype_skip_layers skips quantization for specified layers."""
monkeypatch.setenv("VLLM_ALLOW_INSECURE_SERIALIZATION", "1")
with vllm_runner(
"facebook/opt-125m",
kv_cache_dtype="fp8",
kv_cache_dtype_skip_layers=["0", "2"],
enforce_eager=True,
) as llm:
def check_layers(model):
for i, layer in enumerate(model.model.decoder.layers):
expected = "auto" if str(i) in ["0", "2"] else "fp8"
assert layer.self_attn.attn.kv_cache_dtype == expected
llm.apply_model(check_layers)
#!/usr/bin/env bash
# num_stages sweep for TQ Triton decode kernel _tq_decode_stage1
# Tests num_stages=1,2,3 for k8v4 (GPU2) and turboquant_4bit_nc (GPU3)
# Usage: bash tools/num_stages_sweep.sh
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
KERNEL_FILE="$SCRIPT_DIR/vllm/v1/attention/ops/triton_turboquant_decode.py"
PYTHON="$SCRIPT_DIR/.venv/bin/python"
MODEL="Qwen/Qwen3-4B"
RESULTS_FILE="$SCRIPT_DIR/tools/num_stages_results.txt"
echo "=== TQ Triton num_stages sweep ===" | tee "$RESULTS_FILE"
echo "Date: $(date)" | tee -a "$RESULTS_FILE"
echo "Kernel: $KERNEL_FILE" | tee -a "$RESULTS_FILE"
echo "" | tee -a "$RESULTS_FILE"
patch_num_stages() {
local ns=$1
# Replace num_stages=N in the _tq_decode_stage1 launch (line ~548)
sed -i "s/^\( num_stages=\)[0-9]\+,$/\1${ns},/" "$KERNEL_FILE"
echo " patched num_stages=$ns in $KERNEL_FILE"
}
run_bench() {
local gpu=$1
local port=$2
local preset=$3
local ns=$4
echo ""
echo "--- preset=$preset num_stages=$ns GPU=$gpu ---" | tee -a "$RESULTS_FILE"
# Start server
echo " Starting server on GPU $gpu port $port..."
CUDA_VISIBLE_DEVICES=$gpu $PYTHON -m vllm.entrypoints.openai.api_server \
--model "$MODEL" \
--kv-cache-dtype "$preset" \
--port "$port" \
--max-model-len 32768 \
--disable-log-requests \
> /tmp/vllm_gpu${gpu}_ns${ns}.log 2>&1 &
SERVER_PID=$!
# Wait for server to be ready
echo " Waiting for server (pid=$SERVER_PID)..."
local max_wait=120
local elapsed=0
while ! curl -sf "http://localhost:${port}/health" > /dev/null 2>&1; do
sleep 3
elapsed=$((elapsed + 3))
if [ $elapsed -ge $max_wait ]; then
echo " ERROR: server did not start after ${max_wait}s" | tee -a "$RESULTS_FILE"
kill $SERVER_PID 2>/dev/null || true
return 1
fi
done
echo " Server ready after ${elapsed}s"
# Run benchmark
echo " Running bench..."
BENCH_OUT=$(
$PYTHON -m sglang.bench_serving \
--backend vllm \
--port "$port" \
--model "$MODEL" \
--dataset-name random \
--random-input-len 64 \
--random-output-len 1024 \
--num-prompts 200 \
--request-rate inf 2>&1
)
echo "$BENCH_OUT" >> "$RESULTS_FILE"
# Extract key metrics
THROUGHPUT=$(echo "$BENCH_OUT" | grep -oP 'Output token throughput.*?:\s*\K[\d.]+' | head -1 || echo "N/A")
MEDIAN_TTFT=$(echo "$BENCH_OUT" | grep -oP 'Median TTFT.*?:\s*\K[\d.]+' | head -1 || echo "N/A")
echo " output_tok/s=$THROUGHPUT median_ttft_ms=$MEDIAN_TTFT" | tee -a "$RESULTS_FILE"
# Kill server
kill $SERVER_PID 2>/dev/null || true
wait $SERVER_PID 2>/dev/null || true
sleep 2
}
# ===== Sweep k8v4 on GPU 2 =====
PRESET="turboquant_k8v4"
GPU=2
PORT=8502
echo "### PRESET: $PRESET GPU: $GPU ###" | tee -a "$RESULTS_FILE"
for NS in 1 2 3; do
patch_num_stages $NS
run_bench $GPU $PORT "$PRESET" $NS
done
# Restore to 2 (default)
patch_num_stages 2
# ===== Sweep turboquant_4bit_nc on GPU 3 =====
PRESET="turboquant_4bit_nc"
GPU=3
PORT=8503
echo "" | tee -a "$RESULTS_FILE"
echo "### PRESET: $PRESET GPU: $GPU ###" | tee -a "$RESULTS_FILE"
for NS in 1 2 3; do
patch_num_stages $NS
run_bench $GPU $PORT "$PRESET" $NS
done
# Restore to 2
patch_num_stages 2
echo "" | tee -a "$RESULTS_FILE"
echo "=== Sweep complete. Results in $RESULTS_FILE ===" | tee -a "$RESULTS_FILE"
#!/usr/bin/env bash
# Run a single benchmark: start server, bench, kill server, print results
# Usage: bash tools/run_single_bench.sh <GPU> <PORT> <PRESET> <NS_LABEL>
# NS_LABEL is just for logging (the kernel file must already be patched)
set -euo pipefail
GPU=$1
PORT=$2
PRESET=$3
NS_LABEL=$4
PYTHON=/home/vibhav.agarwal/vllm-tq/.venv/bin/python
MODEL=Qwen/Qwen3-4B
LOG_DIR=/home/vibhav.agarwal/vllm-tq/tools/sweep_logs
echo ">>> START preset=$PRESET num_stages=$NS_LABEL gpu=$GPU port=$PORT"
# Start server
CUDA_VISIBLE_DEVICES=$GPU $PYTHON -m vllm.entrypoints.openai.api_server \
--model "$MODEL" \
--kv-cache-dtype "$PRESET" \
--port "$PORT" \
--max-model-len 32768 \
--disable-log-stats \
> "$LOG_DIR/gpu${GPU}_${PRESET}_ns${NS_LABEL}.log" 2>&1 &
SERVER_PID=$!
# Wait for server ready (max 150s)
for i in $(seq 1 50); do
if curl -sf "http://localhost:${PORT}/health" > /dev/null 2>&1; then
echo " server ready at t=${i}*3s"
break
fi
sleep 3
if [ $i -eq 50 ]; then
echo "ERROR: server did not start"
kill -9 $SERVER_PID 2>/dev/null || true
exit 1
fi
done
# Run benchmark
BENCH_LOG="$LOG_DIR/bench_${PRESET}_ns${NS_LABEL}.log"
$PYTHON -m sglang.bench_serving \
--backend vllm \
--port "$PORT" \
--model "$MODEL" \
--dataset-name random \
--random-input-len 64 \
--random-output-len 1024 \
--num-prompts 200 \
--request-rate inf \
> "$BENCH_LOG" 2>&1
# Extract metrics
OUT_TPS=$(grep -oP 'Output token throughput.*?:\s*\K[\d.]+' "$BENCH_LOG" | head -1 || echo "N/A")
MEDIAN_ITL=$(grep -oP 'Median ITL.*?:\s*\K[\d.]+' "$BENCH_LOG" | head -1 || echo "N/A")
MEDIAN_TPOT=$(grep -oP 'Median TPOT.*?:\s*\K[\d.]+' "$BENCH_LOG" | head -1 || echo "N/A")
echo " RESULT: output_tok/s=$OUT_TPS median_itl_ms=$MEDIAN_ITL median_tpot_ms=$MEDIAN_TPOT"
# Kill server and release GPU memory
kill -9 $SERVER_PID 2>/dev/null || true
# Also kill EngineCore child processes
pkill -9 -f "VLLM::EngineCore" 2>/dev/null || true
sleep 5
......@@ -205,6 +205,31 @@ class Attention(nn.Module, AttentionLayerBase):
cache_config.cache_dtype = "fp8"
cache_config.calculate_kv_scales = False
# Skip quantization for specified layers
if cache_config is not None and cache_config.kv_cache_dtype_skip_layers:
from vllm.model_executor.models.utils import extract_layer_index
skip = False
# Check attention type
if (
sliding_window is not None
and "sliding_window" in cache_config.kv_cache_dtype_skip_layers
):
skip = True
# Check layer index
layer_idx = extract_layer_index(prefix)
if str(layer_idx) in cache_config.kv_cache_dtype_skip_layers:
skip = True
if skip:
kv_cache_dtype = "auto"
calculate_kv_scales = False
logger.info(
"Layer %s: kv_cache_dtype=%s, sliding_window=%s",
prefix,
kv_cache_dtype,
sliding_window,
)
self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype(
kv_cache_dtype, vllm_config.model_config
)
......@@ -326,6 +351,10 @@ class Attention(nn.Module, AttentionLayerBase):
# Initialize KV cache quantization attributes
_init_kv_cache_quant(self, quant_config, prefix)
# Initialize TurboQuant buffers (Pi, S, centroids) if tq cache dtype
if kv_cache_dtype.startswith("turboquant_"):
self._init_turboquant_buffers(kv_cache_dtype, head_size, prefix)
# for attn backends supporting query quantization
self.query_quant = None
# @TODO
......@@ -344,6 +373,42 @@ class Attention(nn.Module, AttentionLayerBase):
else GroupShape.PER_TENSOR,
)
def _init_turboquant_buffers(
self, cache_dtype: str, head_size: int, prefix: str
) -> None:
"""Initialize TurboQuant rotation/projection matrices and centroids."""
from vllm.model_executor.layers.quantization.turboquant.centroids import (
get_centroids,
)
from vllm.model_executor.layers.quantization.turboquant.config import (
TurboQuantConfig,
)
from vllm.model_executor.layers.quantization.turboquant.quantizer import (
generate_wht_signs,
)
tq_config = TurboQuantConfig.from_cache_dtype(cache_dtype, head_size)
# Each layer needs a unique rotation matrix so quantization errors
# don't correlate across layers. Stride must exceed max head_dim to
# ensure non-overlapping RNG streams between adjacent layers.
_TQ_LAYER_SEED_STRIDE = 1337
from vllm.model_executor.models.utils import extract_layer_index
layer_idx = extract_layer_index(prefix)
seed = tq_config.seed + layer_idx * _TQ_LAYER_SEED_STRIDE
self.register_buffer(
"_tq_signs",
generate_wht_signs(head_size, seed=seed),
)
self.register_buffer(
"_tq_centroids",
get_centroids(head_size, tq_config.centroid_bits),
)
self._tq_config = tq_config
def forward(
self,
query: torch.Tensor,
......@@ -499,6 +564,23 @@ class Attention(nn.Module, AttentionLayerBase):
dtype=self.kv_cache_torch_dtype,
sliding_window=self.sliding_window,
)
elif self.kv_cache_dtype.startswith("turboquant_"):
from vllm.model_executor.layers.quantization.turboquant.config import (
TurboQuantConfig,
)
from vllm.v1.kv_cache_interface import TQFullAttentionSpec
tq_config = TurboQuantConfig.from_cache_dtype(
self.kv_cache_dtype, self.head_size
)
return TQFullAttentionSpec(
block_size=block_size,
num_kv_heads=self.num_kv_heads,
head_size=self.head_size,
head_size_v=self.head_size,
dtype=self.kv_cache_torch_dtype,
tq_slot_size=tq_config.slot_size_aligned,
)
else:
return FullAttentionSpec(
block_size=block_size,
......
......@@ -29,6 +29,11 @@ class AttentionConfig:
flash_attn_max_num_splits_for_cuda_graph: int = 32
"""Flash Attention max number splits for cuda graph decode."""
tq_max_kv_splits_for_cuda_graph: int = 32
"""TurboQuant max NUM_KV_SPLITS for cuda graph decode.
Fixes the split count so grid dimensions are constant across captures,
and buffers can be pre-allocated to avoid inflating the memory estimate."""
use_cudnn_prefill: bool = False
"""Whether to use cudnn prefill."""
......
......@@ -31,6 +31,10 @@ CacheDType = Literal[
"fp8_e5m2",
"fp8_inc",
"fp8_ds_mla",
"turboquant_k8v4",
"turboquant_4bit_nc",
"turboquant_k3v4_nc",
"turboquant_3bit_nc",
"int8",
]
MambaDType = Literal["auto", "float32", "float16"]
......@@ -111,6 +115,9 @@ class CacheConfig:
"""This enables dynamic calculation of `k_scale` and `v_scale` when
kv_cache_dtype is fp8. If `False`, the scales will be loaded from the model
checkpoint if available. Otherwise, the scales will default to 1.0."""
kv_cache_dtype_skip_layers: list[str] = field(default_factory=list)
"""Layer patterns to skip KV cache quantization. Accepts layer indices
(e.g., '0', '2', '4') or attention type names (e.g., 'sliding_window')."""
cpu_kvcache_space_bytes: int | None = None
"""(CPU backend only) CPU key-value cache space."""
mamba_page_size_padded: int | None = None
......
......@@ -556,6 +556,9 @@ class EngineArgs:
attention_backend: AttentionBackendEnum | None = AttentionConfig.backend
calculate_kv_scales: bool = CacheConfig.calculate_kv_scales
kv_cache_dtype_skip_layers: list[str] = get_field(
CacheConfig, "kv_cache_dtype_skip_layers"
)
mamba_cache_dtype: MambaDType = CacheConfig.mamba_cache_dtype
mamba_ssm_cache_dtype: MambaDType = CacheConfig.mamba_ssm_cache_dtype
mamba_block_size: int | None = get_field(CacheConfig, "mamba_block_size")
......@@ -943,6 +946,9 @@ class EngineArgs:
cache_group.add_argument(
"--calculate-kv-scales", **cache_kwargs["calculate_kv_scales"]
)
cache_group.add_argument(
"--kv-cache-dtype-skip-layers", **cache_kwargs["kv_cache_dtype_skip_layers"]
)
cache_group.add_argument(
"--kv-sharing-fast-prefill", **cache_kwargs["kv_sharing_fast_prefill"]
)
......@@ -1432,6 +1438,7 @@ class EngineArgs:
prefix_caching_hash_algo=self.prefix_caching_hash_algo,
cpu_offload_gb=self.cpu_offload_gb,
calculate_kv_scales=self.calculate_kv_scales,
kv_cache_dtype_skip_layers=self.kv_cache_dtype_skip_layers,
kv_sharing_fast_prefill=self.kv_sharing_fast_prefill,
mamba_cache_dtype=self.mamba_cache_dtype,
mamba_ssm_cache_dtype=self.mamba_ssm_cache_dtype,
......@@ -1441,6 +1448,30 @@ class EngineArgs:
kv_offloading_backend=self.kv_offloading_backend,
)
# TurboQuant: auto-skip first/last 2 layers (boundary protection).
# These layers are most sensitive to quantization error.
# Users can add extra layers via --kv-cache-dtype-skip-layers.
# Disabled for hybrid models (attn+mamba) — mixed page sizes break
# the required page size unification.
if (
resolved_cache_dtype.startswith("turboquant_")
and not model_config.is_hybrid
):
from vllm.model_executor.layers.quantization.turboquant.config import (
TurboQuantConfig,
)
num_layers = model_config.hf_text_config.num_hidden_layers
boundary = TurboQuantConfig.get_boundary_skip_layers(num_layers)
existing = set(cache_config.kv_cache_dtype_skip_layers)
merged = sorted(existing | set(boundary), key=lambda x: int(x))
cache_config.kv_cache_dtype_skip_layers = merged
logger.info(
"TQ: skipping layers %s for boundary protection (num_layers=%d)",
merged,
num_layers,
)
ray_runtime_env = None
if is_ray_initialized():
# Ray Serve LLM calls `create_engine_config` in the context
......
......@@ -357,7 +357,8 @@ def set_forward_context(
forward_start_time = time.perf_counter()
dp_metadata: DPMetadata | None = None
if vllm_config.parallel_config.data_parallel_size > 1 and (
if vllm_config.parallel_config.data_parallel_size > 1 and \
envs.VLLM_ALL2ALL_BACKEND != "deepep_low_latency" and (
attn_metadata is not None or num_tokens is not None
):
# If num_tokens_across_dp hasn't already been initialized, then
......
......@@ -3,6 +3,9 @@
import torch
import torch.nn.functional as F
import triton
import triton.language as tl
import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm.forward_context import get_forward_context, is_forward_context_available
......@@ -33,8 +36,10 @@ from vllm.utils.deep_gemm import (
from vllm.utils.math_utils import cdiv, round_up
from vllm.utils.import_utils import has_deep_gemm
from vllm.model_executor.layers.activation import SiluAndMul
from lightop import fuse_silu_mul_quant_ep
from lmslim.layers.gemm.int8_utils import per_token_quant_int8
if has_deep_gemm():
from deepgemm import m_grouped_w8a8_gemm_nt_masked
else:
......@@ -45,6 +50,161 @@ else:
logger = init_logger(__name__)
# ==============================================
# MOE Grouped GEMM Triton内核 (int8量化 + 专家并行)
# 输入布局:All2All后 -> [E, M, K] / [E, N, K]
# 输出:[E, M, N] 直接写入传入的output张量
# ==============================================
@triton.jit
def moe_grouped_gemm_kernel(
# 指针
A_ptr, B_ptr,
A_scale_ptr, B_scale_ptr,
token_counts_ptr,
output_ptr,
# 维度步长 (Batch/E维度步长, M/Token步长, N/Out通道步长, K/特征步长)
stride_A_E, stride_A_M, stride_A_K,
stride_B_E, stride_B_N, stride_B_K,
stride_A_scale_E, stride_A_scale_M,
stride_B_scale_E, stride_B_scale_N,
stride_out_E, stride_out_M, stride_out_N,
# 固定维度
E: tl.constexpr, # 专家总数
M: tl.constexpr, # 每个专家最大Token数
N: tl.constexpr, # 每个专家输出维度
K: tl.constexpr, # 输入特征维度
# 分块参数 (T自动调优)
BLOCK_M: tl.constexpr,
BLOCK_N: tl.constexpr,
BLOCK_K: tl.constexpr,
):
# ===================== 1. 专家ID + 计算坐标 =====================
# 程序ID对应:专家ID(E) + Token分块(M) + 输出分块(N)
pid_e = tl.program_id(0) # 专家维度 (0~E-1)
pid_m = tl.program_id(1) # Token分块维度
pid_n = tl.program_id(2) # 输出分块维度
# 当前专家实际需要计算的Token数量
token_cnt = tl.load(token_counts_ptr + pid_e)
# 超出实际Token数直接退出 (动态Token数)
if pid_m * BLOCK_M >= token_cnt:
return
# ===================== 2. 计算当前分块的内存偏移 =====================
# 输入A [E, M, K]
A_base = A_ptr + pid_e * stride_A_E
# 权重B [E, N, K]
B_base = B_ptr + pid_e * stride_B_E
# Scale
A_scale_base = A_scale_ptr + pid_e * stride_A_scale_E
B_scale_base = B_scale_ptr + pid_e * stride_B_scale_E
# 输出 [E, M, N]
out_base = output_ptr + pid_e * stride_out_E
# 分块坐标
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
# 内存索引
a_ptrs = A_base + (offs_m[:, None] * stride_A_M + offs_k[None, :] * stride_A_K)
b_ptrs = B_base + (offs_n[:, None] * stride_B_N + offs_k[None, :] * stride_B_K)
# ===================== 3. 初始化累加器 =====================
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
# ===================== 4. K维度循环计算GEMM (int8矩阵乘) =====================
for k in range(0, K, BLOCK_K):
# 加载int8数据 (保持int8精度)
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[None, :] < K - k, other=0.0)
# 矩阵乘累加
acc += tl.dot(a, tl.trans(b)) # B: [N,K] -> 转置为[K,N]
# 指针步进
a_ptrs += BLOCK_K * stride_A_K
b_ptrs += BLOCK_K * stride_B_K
# ===================== 5. int8反量化 (Per-Token + Per-Output Channel) =====================
# 加载当前专家的scale
a_scale = tl.load(A_scale_base + offs_m * stride_A_scale_M) # [BLOCK_M]
b_scale = tl.load(B_scale_base + offs_n * stride_B_scale_N) # [BLOCK_N]
# 反量化:out = (int8_mm) * A_scale * B_scale
result = acc * a_scale[:, None] * b_scale[None, :]
# ===================== 6. 写入输出 [E, M, N] =====================
out_ptrs = out_base + (offs_m[:, None] * stride_out_M + offs_n[None, :] * stride_out_N)
# 掩码:只写有效Token + 有效输出通道
mask_m = offs_m < token_cnt
mask_n = offs_n < N
mask = mask_m[:, None] & mask_n[None, :]
tl.store(out_ptrs, result, mask=mask)
# ==============================================
# 包装函数 (对外调用接口,自动处理步长/启动网格)
# ==============================================
def moe_grouped_gemm(
A: torch.Tensor, # [E, M, K]
B: torch.Tensor, # [E, N, K] int8
A_scale: torch.Tensor, # [E, M, 1]
B_scale: torch.Tensor, # [E, N, 1]
token_counts: torch.Tensor, # [E]
output: torch.Tensor, # [E, M, N] (传入,直接写入)
):
# 维度校验
E, M, K = A.shape
_, N, _ = B.shape
assert B.shape == (E, N, K)
assert A_scale.shape == (E, M, 1)
assert B_scale.shape == (E, N, 1)
assert token_counts.shape == (E,)
assert output.shape == (E, M, N)
# 设备统一
assert A.device == B.device == A_scale.device == B_scale.device == token_counts.device == output.device
assert A.is_cuda
# 自动分块大小 (适配主流GPU)
BLOCK_M = 64
BLOCK_N = 64
BLOCK_K = 64
# 计算网格:[E, ceil(M/BLOCK_M), ceil(N/BLOCK_N)]
grid = (
E,
triton.cdiv(M, BLOCK_M),
triton.cdiv(N, BLOCK_N),
)
# 启动内核
moe_grouped_gemm_kernel[grid](
# 数据指针
A, B,
A_scale, B_scale,
token_counts,
output,
# 步长 (按最后一维连续的张量自动计算)
stride_A_E=A.stride(0), stride_A_M=A.stride(1), stride_A_K=A.stride(2),
stride_B_E=B.stride(0), stride_B_N=B.stride(1), stride_B_K=B.stride(2),
stride_A_scale_E=A_scale.stride(0), stride_A_scale_M=A_scale.stride(1),
stride_B_scale_E=B_scale.stride(0), stride_B_scale_N=B_scale.stride(1),
stride_out_E=output.stride(0), stride_out_M=output.stride(1), stride_out_N=output.stride(2),
# 固定维度
E=E, M=M, N=N, K=K,
# 分块参数
BLOCK_M=BLOCK_M,
BLOCK_N=BLOCK_N,
BLOCK_K=BLOCK_K,
)
return output
def scales_shape_stride_dtype(
E: int, T: int, G: int, quant_scale_fmt: DeepGemmQuantScaleFMT
) -> tuple[tuple[int, ...], tuple[int, ...], torch.dtype]:
......@@ -297,6 +457,7 @@ class BatchedDeepGemmExperts(mk.FusedMoEPermuteExpertsUnpermute):
self.N = N
self.K = K
self.act_fn = SiluAndMul()
@staticmethod
def activation_format() -> mk.FusedMoEActivationFormat:
......@@ -414,7 +575,7 @@ class BatchedDeepGemmExperts(mk.FusedMoEPermuteExpertsUnpermute):
workspace2: torch.Tensor,
expert_tokens_meta: mk.ExpertTokensMetadata | None,
apply_router_weight_on_input: bool,
use_nn_moe: bool | None = False,
**_
):
assert expert_tokens_meta is not None
expert_num_tokens = expert_tokens_meta.expert_num_tokens
......@@ -436,11 +597,13 @@ class BatchedDeepGemmExperts(mk.FusedMoEPermuteExpertsUnpermute):
workspace1 = _resize_cache(workspace13, (E, max_num_tokens, N))
expected_m = self.estimate_expected_m(
global_num_experts=global_num_experts,
max_tokens_per_expert=max_num_tokens,
topk=topk_ids.size(-1),
)
# expected_m = self.estimate_expected_m(
# global_num_experts=global_num_experts,
# max_tokens_per_expert=max_num_tokens,
# topk=topk_ids.size(-1),
# )
expected_m = self.get_expected_m()
if self.quant_config.use_fp8_w8a16 or self.quant_config.use_fp8_w8a8:
fp8_m_grouped_gemm_nt_masked(
......
......@@ -297,21 +297,19 @@ class DeepEPLLPrepareAndFinalize(mk.FusedMoEPrepareAndFinalize):
# Dispatch
dispatch_topk_ids = self._map_global_to_physical_ids(topk_ids)
quant_type = 0
if self.use_int8_dispatch:
quant_type = 1
elif self.use_fp8_dispatch:
quant_type = 2
expert_x, expert_num_tokens, handle, _, hook = self.buffer.low_latency_dispatch(
a1,
dispatch_topk_ids,
self.max_tokens_per_rank,
num_experts,
use_fp8=self.use_fp8_dispatch or self.use_int8_dispatch,
use_int8=self.use_int8_dispatch,
round_scale=self.use_ue8m0_dispatch,
use_ue8m0=self.use_ue8m0_dispatch,
**(dict(use_nvfp4=True) if use_nvfp4 else dict()),
**(
dict(x_global_scale=qc_a1_gscale_or_scale)
if qc_a1_gscale_or_scale is not None
else dict()
),
quant_type = quant_type,
fp8_round_scale=False,
async_finish=False,
return_recv_hook=True,
)
......
......@@ -853,7 +853,7 @@ class FusedMoE(CustomOp):
def use_dp_chunking(self) -> bool:
return (
self.moe_parallel_config.use_pplx_kernels
or self.moe_parallel_config.use_deepep_ll_kernels
#or self.moe_parallel_config.use_deepep_ll_kernels
or self.moe_parallel_config.use_mori_kernels
or (self.dp_size > 1 and self.use_flashinfer_cutlass_kernels)
) and envs.VLLM_ENABLE_MOE_DP_CHUNK
......
......@@ -406,6 +406,7 @@ class FusedMoEPermuteExpertsUnpermute(ABC):
self.quant_config = quant_config
self.max_num_tokens = max_num_tokens
self.num_dispatchers = num_dispatchers
self.expected_m = max_num_tokens
@staticmethod
def expects_unquantized_inputs(
......@@ -775,6 +776,12 @@ class FusedMoEPermuteExpertsUnpermute(ABC):
"""
raise NotImplementedError
def set_expected_m(self, expected_m):
self.expected_m = expected_m
def get_expected_m(self):
return self.expected_m
def _slice_scales(
scales: torch.Tensor | None, start: int, end: int
......@@ -1074,6 +1081,12 @@ class FusedMoEModularKernel(torch.nn.Module):
The _prepare method is a wrapper around self.prepare_finalize.prepare
that handles DBO and async.
"""
expected_m = (
hidden_states.shape[0] * self.fused_experts.num_dispatchers * topk_ids.shape[1]
+ global_num_experts
) // global_num_experts
self.fused_experts.set_expected_m(expected_m)
if not self.prepare_finalize.supports_async():
# We shouldn't be running an a2a kernel that doesn't
# support async prepare/finalize
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment