Commits · 80f63a3966a6dbfd492ccd74004da5929cdba2bb · OpenDAS / vllm_cscc

15 Feb, 2025 1 commit
- [AMD] [Model] DeepSeek tunings (#13199) · ed0de3e4
  rasmith authored Feb 15, 2025
  
  ed0de3e4
14 Feb, 2025 3 commits
- [Core] Reduce TTFT with concurrent partial prefills (#10235) · 3bcb8c75
  Joe Runde authored Feb 14, 2025
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  3bcb8c75
- [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236) · 5e5c8e09
  Michael Goin authored Feb 14, 2025
```
Signed-off-by: mgoin <mgoin64@gmail.com>
```
  5e5c8e09
- [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198) · c1e37bf7
  Tyler Michael Smith authored Feb 13, 2025
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
  c1e37bf7
13 Feb, 2025 2 commits
- Optimize moe_align_block_size for deepseek_v3 (#12850) · 2344192a
  Michael Goin authored Feb 13, 2025
```
Signed-off-by: mgoin <mgoin64@gmail.com>
```
  2344192a
- Allow Unsloth Dynamic 4bit BnB quants to work (#12974) · cb944d58
  Daniel Han authored Feb 12, 2025
  
  cb944d58
12 Feb, 2025 2 commits
- [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity (#13119) · 09972e71
  Michael Goin authored Feb 12, 2025
  
  09972e71
- [CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per module override/control (#7086) · 36a08630
  Qubitium-ModelCloud authored Feb 13, 2025
  
  36a08630
11 Feb, 2025 1 commit
- Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 (#13023) · 2b25b7d2
  Szymon Ożóg authored Feb 11, 2025
  
  2b25b7d2
08 Feb, 2025 1 commit
- [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (#12812) · 2880e21e
  Sanju C Sudhakaran authored Feb 08, 2025
```
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
```
  2880e21e
07 Feb, 2025 2 commits
- [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation... · eaa92d44
  TJian authored Feb 08, 2025
```
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501)
```
  eaa92d44
- PR #12718 (#12718) · 538fab93
  Amit Garg authored Feb 07, 2025
  
  538fab93
06 Feb, 2025 6 commits
- [MISC] Check space in the file names in the pre commit checks (#12804) · 741429a4
  Lu Fang authored Feb 06, 2025
```
Signed-off-by: Lu Fang <lufang@fb.com>
```
  741429a4
- Add Bamba Model (#10909) · aff40457
  Yu Chin Fabian Lim authored Feb 07, 2025
```
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
  aff40457
- [V1] LoRA Support (#10957) · 467a96a5
  Varun Sundar Rabindranath authored Feb 06, 2025
```
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  467a96a5
- [Misc] Update w2 scale loading for GPTQMarlinMoE (#12757) · 7ca9934f
  Dipika Sikka authored Feb 06, 2025
  
  7ca9934f
- [Misc][Easy] Remove the space from the file name · 9cdea30b
  Lu Fang authored Feb 05, 2025
  
  9cdea30b
- [Bugfix] Better FP8 supported defaults · 76abd0c8
  Lucas Wilkinson authored Feb 05, 2025
  
  76abd0c8
05 Feb, 2025 5 commits
- [VLM] Qwen2.5-VL · bf3b79ef
  Roger Wang authored Feb 05, 2025
  
  bf3b79ef
- Add: Support for Sparse24Bitmask Compressed Models · 3b2005e1
  Rahul Tuli authored Feb 05, 2025
  
  3b2005e1
- [Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634) · 7ff7a638
  Kyle Sayers authored Feb 05, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  7ff7a638
- Refactor `Linear` handling in `TransformersModel` (#12727) · 249824c3
  Harry Mellor authored Feb 05, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  249824c3
- [Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` (#12368) · b3a0d01e
  Aviv Keshet authored Feb 04, 2025
```
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
```
  b3a0d01e
04 Feb, 2025 2 commits
- [AMD][ROCm] Enable DeepSeek model on ROCm (#12662) · c36ac98d
  Hongxia Yang authored Feb 04, 2025
```
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
```
  c36ac98d
- [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) · 4896d0c2
  Kyle Sayers authored Feb 04, 2025
  
  4896d0c2
03 Feb, 2025 4 commits

[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) · 4797dad3
kushanam authored Feb 03, 2025

4797dad3

Fix for attention layers to remain unquantized during moe_wn16 quant (#12570) · b9986454

Srikanth Srinivas authored Feb 02, 2025



Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

b9986454

Properly check if all fused layers are in the list of targets (#12666) · c5932e5d
Eldar Kurtic authored Feb 03, 2025
```
Thanks @kylesayrs for catching this!
```
c5932e5d

[Kernel] port sgl moe_align_block_size kernels (#12574) · 95460fc5

Yang Chen authored Feb 02, 2025

sgl_moe_align_block_size is based on:


https://github.com/sgl-project/sglang/commit/ded9fcd09a43d5e7d5bb31a2bc3e9fc21bf65d2a

moe_align_block_size is based on:


https://github.com/sgl-project/sglang/commit/ba5112ff691d791a9e38c6c71f59324a5fcb49d0

Signed-off-by: Yang Chen <yangche@fb.com>

95460fc5

02 Feb, 2025 2 commits

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

[Bugfix] fix moe_wna16 get_quant_method (#12648) · baaa2b24

Jinzhen Lin authored Feb 02, 2025

Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.

https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

baaa2b24

01 Feb, 2025 4 commits

Apply torch.compile to fused_moe/grouped_topk (#12637) · 3194039c
Michael Goin authored Feb 01, 2025

3194039c

[Attention] Deepseek v3 MLA support with FP8 compute (#12601) · baeded25

Lucas Wilkinson authored Feb 01, 2025



This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

baeded25

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517) · 3e1c76cf

Rahul Tuli authored Jan 31, 2025

This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160

).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

3e1c76cf

Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258

Eldar Kurtic authored Feb 01, 2025

Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.

1867c258

31 Jan, 2025 3 commits

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) · eb5741ad

Tyler Michael Smith authored Jan 31, 2025

Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868

 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

eb5741ad

[Bugfix] Revert MoE Triton Config Default (#12629) · 145c2ff6

Robert Shaw authored Jan 31, 2025

SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files

) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

145c2ff6

[BugFix] Fix Torch.Compile For DeepSeek (#12594) · 325f679f
Robert Shaw authored Jan 31, 2025
```
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
325f679f

30 Jan, 2025 1 commit

[Kernel] Triton Configs for Fp8 Block Quantization (#11589) · 9b0c4bab

Robert Shaw authored Jan 30, 2025


Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

9b0c4bab

29 Jan, 2025 1 commit
- [Kernel] add triton fused moe kernel for gptq/awq (#12185) · 27b78c73
  Jinzhen Lin authored Jan 29, 2025
  
  27b78c73