Commits · afd0da2186c1d58fb48e138df0a2f548612b5d7d · OpenDAS / vllm_cscc

01 Feb, 2025 3 commits

[Attention] Deepseek v3 MLA support with FP8 compute (#12601) · baeded25

Lucas Wilkinson authored Feb 01, 2025



This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

baeded25

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517) · 3e1c76cf

Rahul Tuli authored Jan 31, 2025

This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160

).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

3e1c76cf

Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258

Eldar Kurtic authored Feb 01, 2025

Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.

1867c258

31 Jan, 2025 3 commits

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) · eb5741ad

Tyler Michael Smith authored Jan 31, 2025

Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868

 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

eb5741ad

[Bugfix] Revert MoE Triton Config Default (#12629) · 145c2ff6

Robert Shaw authored Jan 31, 2025

SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files

) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

145c2ff6

[BugFix] Fix Torch.Compile For DeepSeek (#12594) · 325f679f
Robert Shaw authored Jan 31, 2025
```
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
325f679f

30 Jan, 2025 1 commit

[Kernel] Triton Configs for Fp8 Block Quantization (#11589) · 9b0c4bab

Robert Shaw authored Jan 30, 2025


Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

9b0c4bab

29 Jan, 2025 1 commit
- [Kernel] add triton fused moe kernel for gptq/awq (#12185) · 27b78c73
  Jinzhen Lin authored Jan 29, 2025
  
  27b78c73
28 Jan, 2025 1 commit
- Update `pre-commit` hooks (#12475) · 823ab796
  Harry Mellor authored Jan 28, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  823ab796
26 Jan, 2025 1 commit
- [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417) · aa2cd2c4
  Tyler Michael Smith authored Jan 26, 2025
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  aa2cd2c4
25 Jan, 2025 1 commit
- [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408) · bf21481d
  Divakar Verma authored Jan 24, 2025
```
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
```
  bf21481d
23 Jan, 2025 3 commits

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE... · eb5cb5e5

Dipika Sikka authored Jan 23, 2025


[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (#11528)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

eb5cb5e5

[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906) · e97f802b

Gregory Shtrasberg authored Jan 23, 2025


Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

e97f802b

[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282) · 68c4421b
rasmith authored Jan 22, 2025
```
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
```
68c4421b

21 Jan, 2025 2 commits
- [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230) · 5fe6bf29
  Nicolò Lucchesi authored Jan 21, 2025
```
Signed-off-by: NickLucche <nlucches@redhat.com>
```
  5fe6bf29
- [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777) · d4b62d46
  Gregory Shtrasberg authored Jan 20, 2025
```
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
```
  d4b62d46
19 Jan, 2025 3 commits

[Model] Support for fairseq2 Llama (#11442) · bbe5f9de

Martin Gleize authored Jan 19, 2025


Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>

bbe5f9de

[V1] Add V1 support of Qwen2-VL (#12128) · 81763c58

Roger Wang authored Jan 19, 2025


Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

81763c58

[Misc] Support register quantization method out-of-tree (#11969) · 32eb0da8
yancong authored Jan 19, 2025

32eb0da8

17 Jan, 2025 2 commits
- [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134) · b5b57e30
  Gregory Shtrasberg authored Jan 17, 2025
```
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
```
  b5b57e30
- [CI/Build][CPU][Bugfix] Fix CPU CI (#12150) · d4e61945
  Li, Jiang authored Jan 17, 2025
```
Signed-off-by: jiang1.li <jiang1.li@intel.com>
```
  d4e61945
16 Jan, 2025 4 commits
- Support torchrun and SPMD-style offline inference (#12071) · bf53e0c7
  youkaichao authored Jan 16, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  bf53e0c7
- Various cosmetic/comment fixes (#12089) · 9aa1519f
  Michael Goin authored Jan 16, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  9aa1519f
- [Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651) · fa0050db
  Elfie Guo authored Jan 15, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  fa0050db
- 增加w8a8相关修改 · 083b80ea
  zhuwenwen authored Jan 16, 2025
  
  083b80ea
15 Jan, 2025 3 commits

[Misc][Quark] Upstream Quark format to VLLM (#10765) · de0526f6

kewang-xlnx authored Jan 16, 2025


Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

de0526f6

Fix: cases with empty sparsity config (#12057) · cbe94391
Rahul Tuli authored Jan 15, 2025
```
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
```
cbe94391
[Kernel] Support MulAndSilu (#11624) · 42f5e7c5
Jee Jee Li authored Jan 15, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
42f5e7c5

14 Jan, 2025 1 commit
- [feat]上传deepseek_v2_lite、qwen2_moe、mixtral 8*7B、8x822B 4个模型的fused_moe_kernel config · 09428eec
  zhuwenwen authored Jan 14, 2025
  
  09428eec
13 Jan, 2025 2 commits
- [Bugfix] Fix deepseekv3 gate bias error (#12002) · f35ec461
  Steve Luo authored Jan 14, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  f35ec461
- [Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685) · d14e98d9
  Isotr0py authored Jan 13, 2025
```
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  d14e98d9
12 Jan, 2025 1 commit
- [Hardware][TPU] workaround fix for MoE on TPU (#11764) · 263a870e
  Avshalom Manevich authored Jan 12, 2025
  
  263a870e
11 Jan, 2025 1 commit

[Bugfix] fused_experts_impl wrong compute type for float32 (#11921) · c32a7c7c

shaochangxu authored Jan 11, 2025


Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>

c32a7c7c

10 Jan, 2025 3 commits
- [Hardware][CPU] Support MOE models on x86 CPU (#11831) · aa1e77a1
  Li, Jiang authored Jan 11, 2025
```
Signed-off-by: jiang1.li <jiang1.li@intel.com>
```
  aa1e77a1
- [platform] support custom torch.compile backend key (#11318) · 20410b2f
  wangxiyuan authored Jan 10, 2025
```
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
```
  20410b2f
- [misc] remove python function call for custom activation op (#11885) · d907be7d
  cennn authored Jan 10, 2025
```
Co-authored-by: youkaichao <youkaichao@gmail.com>
```
  d907be7d
09 Jan, 2025 1 commit

[Misc] Move `print_*_once` from utils to logger (#11298) · d848800e

Cyrus Leung authored Jan 09, 2025

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

d848800e

08 Jan, 2025 3 commits
- [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (#11698) · 526de822
  rasmith authored Jan 08, 2025
```
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
```
  526de822
- [TPU][Quantization] TPU `W8A8` (#11785) · 56fe4c29
  Robert Shaw authored Jan 08, 2025
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  56fe4c29
- [Bugfix][XPU] fix silu_and_mul (#11823) · 78f4590b
  Yan Ma authored Jan 09, 2025
```
Signed-off-by: yan ma <yan.ma@intel.com>
```
  78f4590b