Commits · e489ad7a210f4234db696d1f2749d5f3662fa65b · OpenDAS / vllm_cscc

02 Feb, 2025 2 commits

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

[Bugfix] fix moe_wna16 get_quant_method (#12648) · baaa2b24

Jinzhen Lin authored Feb 02, 2025

Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.

https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

baaa2b24

01 Feb, 2025 3 commits

[Attention] Deepseek v3 MLA support with FP8 compute (#12601) · baeded25

Lucas Wilkinson authored Feb 01, 2025



This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

baeded25

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517) · 3e1c76cf

Rahul Tuli authored Jan 31, 2025

This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160

).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

3e1c76cf

Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258

Eldar Kurtic authored Feb 01, 2025

Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.

1867c258

31 Jan, 2025 2 commits
- [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) · eb5741ad
  Tyler Michael Smith authored Jan 31, 2025
```
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868

 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
  eb5741ad
- [BugFix] Fix Torch.Compile For DeepSeek (#12594) · 325f679f
  Robert Shaw authored Jan 31, 2025
```
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
  325f679f
30 Jan, 2025 1 commit

[Kernel] Triton Configs for Fp8 Block Quantization (#11589) · 9b0c4bab

Robert Shaw authored Jan 30, 2025


Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

9b0c4bab

29 Jan, 2025 1 commit
- [Kernel] add triton fused moe kernel for gptq/awq (#12185) · 27b78c73
  Jinzhen Lin authored Jan 29, 2025
  
  27b78c73
28 Jan, 2025 1 commit
- Update `pre-commit` hooks (#12475) · 823ab796
  Harry Mellor authored Jan 28, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  823ab796
26 Jan, 2025 1 commit
- [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417) · aa2cd2c4
  Tyler Michael Smith authored Jan 26, 2025
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  aa2cd2c4
23 Jan, 2025 3 commits

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE... · eb5cb5e5

Dipika Sikka authored Jan 23, 2025


[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (#11528)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

eb5cb5e5

[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906) · e97f802b

Gregory Shtrasberg authored Jan 23, 2025


Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

e97f802b

[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282) · 68c4421b
rasmith authored Jan 22, 2025
```
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
```
68c4421b

19 Jan, 2025 1 commit
- [Misc] Support register quantization method out-of-tree (#11969) · 32eb0da8
  yancong authored Jan 19, 2025
  
  32eb0da8
17 Jan, 2025 1 commit
- [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134) · b5b57e30
  Gregory Shtrasberg authored Jan 17, 2025
```
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
```
  b5b57e30
16 Jan, 2025 2 commits
- Various cosmetic/comment fixes (#12089) · 9aa1519f
  Michael Goin authored Jan 16, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  9aa1519f
- [Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651) · fa0050db
  Elfie Guo authored Jan 15, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  fa0050db
15 Jan, 2025 2 commits

[Misc][Quark] Upstream Quark format to VLLM (#10765) · de0526f6

kewang-xlnx authored Jan 16, 2025


Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

de0526f6

Fix: cases with empty sparsity config (#12057) · cbe94391
Rahul Tuli authored Jan 15, 2025
```
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
```
cbe94391

09 Jan, 2025 1 commit

[Misc] Move `print_*_once` from utils to logger (#11298) · d848800e

Cyrus Leung authored Jan 09, 2025

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

d848800e

08 Jan, 2025 2 commits
- [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (#11698) · 526de822
  rasmith authored Jan 08, 2025
```
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
```
  526de822
- [TPU][Quantization] TPU `W8A8` (#11785) · 56fe4c29
  Robert Shaw authored Jan 08, 2025
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  56fe4c29
30 Dec, 2024 1 commit
- [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618) · 5dbf8545
  Li, Jiang authored Dec 30, 2024
```
Signed-off-by: jiang1.li <jiang1.li@intel.com>
```
  5dbf8545
27 Dec, 2024 3 commits
- [Bugfix] Fix for ROCM compressed tensor support (#11561) · ac797994
  Selali authored Dec 27, 2024
  
  ac797994
- [BugFix] Fix quantization for all other methods (#11547) · 2339d59f
  Robert Shaw authored Dec 27, 2024
  
  2339d59f
- Deepseek v3 (#11502) · f49777ba
  Simon Mo authored Dec 26, 2024
```
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com>
```
  f49777ba
26 Dec, 2024 1 commit

[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523) · 2072924d

Michael Goin authored Dec 26, 2024


Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: HandH1998 <1335248067@qq.com>

2072924d

23 Dec, 2024 1 commit
- [Misc] Add assertion and helpful message for marlin24 compressed models (#11388) · b866cdbd
  Dipika Sikka authored Dec 23, 2024
  
  b866cdbd
21 Dec, 2024 1 commit
- [Bugfix] update should_ignore_layer (#11354) · 51ff216d
  George authored Dec 21, 2024
```
Signed-off-by: George Ohashi <george@neuralmagic.com>
```
  51ff216d
19 Dec, 2024 2 commits
- [Bugfix] Fix broken CPU compressed-tensors test (#11338) · 276738ce
  Isotr0py authored Dec 20, 2024
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
  276738ce
- [Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311) · 5a9da2e6
  Tyler Michael Smith authored Dec 18, 2024
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
  5a9da2e6
18 Dec, 2024 1 commit

[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995) · 60508ffd

Dipika Sikka authored Dec 18, 2024


Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

60508ffd

15 Dec, 2024 1 commit
- [[Misc]Upgrade bitsandbytes to the latest version 0.45.0 (#11201) · 15859f23
  Jee Jee Li authored Dec 15, 2024
  
  15859f23
27 Nov, 2024 2 commits
- [Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675) · b98c62ba
  Isotr0py authored Nov 28, 2024
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
  b98c62ba
- [bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig (#10657) · cfb3bf25
  yansh97 authored Nov 27, 2024
  
  cfb3bf25
26 Nov, 2024 1 commit
- [Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642) · 7576cd38
  Michael Goin authored Nov 26, 2024
  
  7576cd38
21 Nov, 2024 1 commit
- [torch.compile] limit inductor threads and lazy import quant (#10482) · 388ee3de
  youkaichao authored Nov 20, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  388ee3de
19 Nov, 2024 1 commit
- [Model][Quantization] HQQ support through Marlin kernel expansion (#9766) · b00b33d7
  ElizaWszola authored Nov 19, 2024
```
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
```
  b00b33d7
18 Nov, 2024 1 commit
- [Kernel] Initial Machete W4A8 support + Refactors (#9855) · 96d999fb
  Lucas Wilkinson authored Nov 18, 2024
```
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
```
  96d999fb