Commits · f256ebe4df6757d76f1f1642d7e110268a2f8190 · OpenDAS / vllm_cscc

02 Feb, 2025 1 commit

[Bugfix] fix moe_wna16 get_quant_method (#12648) · baaa2b24

Jinzhen Lin authored Feb 02, 2025

Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.

https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

baaa2b24

01 Feb, 2025 4 commits

Apply torch.compile to fused_moe/grouped_topk (#12637) · 3194039c
Michael Goin authored Feb 01, 2025

3194039c

[Attention] Deepseek v3 MLA support with FP8 compute (#12601) · baeded25

Lucas Wilkinson authored Feb 01, 2025



This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

baeded25

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517) · 3e1c76cf

Rahul Tuli authored Jan 31, 2025

This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160

).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

3e1c76cf

Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258

Eldar Kurtic authored Feb 01, 2025

Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.

1867c258

31 Jan, 2025 6 commits

[Feature] Fix guided decoding blocking bitmask memcpy (#12563) · fc542144

Ryan Nguyen authored Jan 31, 2025

**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7

)

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>

fc542144

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) · eb5741ad

Tyler Michael Smith authored Jan 31, 2025

Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868

 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

eb5741ad

[Bugfix] Revert MoE Triton Config Default (#12629) · 145c2ff6

Robert Shaw authored Jan 31, 2025

SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files

) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

145c2ff6

[BugFix] Fix Torch.Compile For DeepSeek (#12594) · 325f679f
Robert Shaw authored Jan 31, 2025
```
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
325f679f

[Attention] MLA decode optimizations (#12528) · cabaf4ef

Lucas Wilkinson authored Jan 31, 2025


Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

cabaf4ef

[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421) · a1fc18c0
Aleksandr Malyshev authored Jan 30, 2025
```
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
```
a1fc18c0

30 Jan, 2025 1 commit

[Kernel] Triton Configs for Fp8 Block Quantization (#11589) · 9b0c4bab

Robert Shaw authored Jan 30, 2025


Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

9b0c4bab

29 Jan, 2025 5 commits

[Kernel] add triton fused moe kernel for gptq/awq (#12185) · 27b78c73
Jinzhen Lin authored Jan 29, 2025

27b78c73
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787) · b02fd288
Pavani Majety authored Jan 29, 2025
```
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
b02fd288

[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069) · d93bf4da

Alphi authored Jan 29, 2025


Signed-off-by: hzh <hezhihui_thu@163.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

d93bf4da

[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347) · 036ca94c

Travis Johnson authored Jan 29, 2025


Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Wallas Santos <wallashss@ibm.com>

036ca94c

Bugfix for whisper quantization due to fake k_proj bias (#12524) · bd02164c
Michael Goin authored Jan 28, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
bd02164c

28 Jan, 2025 2 commits
- [VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504) · 8f58a513
  Cyrus Leung authored Jan 29, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  8f58a513
- Update `pre-commit` hooks (#12475) · 823ab796
  Harry Mellor authored Jan 28, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  823ab796
27 Jan, 2025 3 commits
- [FlashInfer] Upgrade to 0.2.0 (#11194) · 2bc3fbba
  Bowen Wang authored Jan 28, 2025
```
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
```
  2bc3fbba
- [Bugfix] Fix gpt2 GGUF inference (#12467) · ce69f7f7
  Isotr0py authored Jan 27, 2025
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
  ce69f7f7
- [Bugfix] Fix Granite 3.0 MoE model loading (#12446) · 5204ff5c
  Cyrus Leung authored Jan 27, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  5204ff5c
26 Jan, 2025 1 commit
- [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417) · aa2cd2c4
  Tyler Michael Smith authored Jan 26, 2025
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  aa2cd2c4
25 Jan, 2025 2 commits
- [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408) · bf21481d
  Divakar Verma authored Jan 24, 2025
```
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
```
  bf21481d
- [Bugfix] Fix BLIP-2 processing (#12412) · fb30ee92
  Cyrus Leung authored Jan 25, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  fb30ee92
24 Jan, 2025 1 commit
- Set weights_only=True when using torch.load() (#12366) · d3d6bb13
  Russell Bryant authored Jan 23, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  d3d6bb13
23 Jan, 2025 4 commits

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE... · eb5cb5e5

Dipika Sikka authored Jan 23, 2025


[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (#11528)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

eb5cb5e5

[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906) · e97f802b

Gregory Shtrasberg authored Jan 23, 2025


Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

e97f802b

[Bugfix] Fix k_proj's bias for whisper self attention (#12342) · c5b4b11d
Isotr0py authored Jan 23, 2025
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
c5b4b11d
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282) · 68c4421b
rasmith authored Jan 22, 2025
```
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
```
68c4421b

22 Jan, 2025 7 commits
- [Misc] Improve the readability of BNB error messages (#12320) · 84bee4bd
  Jee Jee Li authored Jan 23, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  84bee4bd
- [Doc] Add docs for prompt replacement (#12318) · 6609cdf0
  Cyrus Leung authored Jan 22, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  6609cdf0
- [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (#12313) · 16366ee8
  Roger Wang authored Jan 22, 2025
```
Signed-off-by: Roger Wang <ywang@roblox.com>
```
  16366ee8
- [Model][Bugfix]: correct Aria model output (#12309) · 528dbcac
  zhou fan authored Jan 22, 2025
```
Signed-off-by: xffxff <1247714429@qq.com>
```
  528dbcac
- [VLM] Avoid unnecessary tokenization (#12310) · cd7b6f08
  Cyrus Leung authored Jan 22, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  cd7b6f08
- [ci/lint] Add back default arg for pre-commit (#12279) · 64ea24d0
  Kevin H. Luu authored Jan 21, 2025
```
Signed-off-by: kevin <kevin@anyscale.com>
```
  64ea24d0
- [VLM] Simplify post-processing of replacement info (#12269) · df76e5af
  Cyrus Leung authored Jan 22, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  df76e5af
21 Jan, 2025 3 commits
- [Misc] Set default backend to SDPA for get_vit_attn_backend (#12235) · fa9ee081
  wangxiyuan authored Jan 22, 2025
```
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
```
  fa9ee081
- [Misc] Remove redundant TypeVar from base model (#12248) · f2e9f2a3
  Cyrus Leung authored Jan 21, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  f2e9f2a3
- [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#12237) · 1f1542af
  Jee Jee Li authored Jan 21, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  1f1542af