Commits · f256ebe4df6757d76f1f1642d7e110268a2f8190 · OpenDAS / vllm_cscc

02 Feb, 2025 5 commits

[Hardware][Intel GPU] add XPU bf16 support (#12392) · f256ebe4
Kunshang Ji authored Feb 02, 2025
```
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
```
f256ebe4

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) · f8ece6e1

Shawn Du authored Feb 02, 2025

As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254

,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------
Signed-off-by: Shawn Du <shawnd200@outlook.com>

f8ece6e1

[V1][Minor] Avoid frequently creating ConstantList (#12653) · abfcdcdf

Woosuk Kwon authored Feb 01, 2025



A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

abfcdcdf

[Core] Silence unnecessary deprecation warnings (#12620) · e497f334

Russell Bryant authored Feb 02, 2025



I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:

```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
     and will be removed in a future version.
     Please use 'lora_path' instead.
```

The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e497f334

[Bugfix] fix moe_wna16 get_quant_method (#12648) · baaa2b24

Jinzhen Lin authored Feb 02, 2025

Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.

https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

baaa2b24

01 Feb, 2025 11 commits

doc: fixing minor typo in readme.md (#12643) · b4e5c033

Vicente Herrera authored Feb 01, 2025



Word "evolved" was mistyped
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

---------
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

b4e5c033

Apply torch.compile to fused_moe/grouped_topk (#12637) · 3194039c
Michael Goin authored Feb 01, 2025

3194039c

Disable chunked prefill and/or prefix caching when MLA is enabled (#12642) · 4f4d427a

Simon Mo authored Jan 31, 2025

From @mgoin in https://github.com/vllm-project/vllm/pull/12638



I cannot push to that branch, therefore a new PR to unblock release.

---------
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

4f4d427a

[CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280) · 1e369839

Russell Bryant authored Feb 01, 2025



We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

1e369839

[Attention] Deepseek v3 MLA support with FP8 compute (#12601) · baeded25

Lucas Wilkinson authored Feb 01, 2025



This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

baeded25

Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517) · 3e1c76cf

Rahul Tuli authored Jan 31, 2025

This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c4 # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160

).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

3e1c76cf

[Bugfix/CI] Fixup benchmark_moe.py (#12562) · cfa134d2

Tyler Michael Smith authored Feb 01, 2025

Fixes `is_marlin` not being passed into `get_default_config`

Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

cfa134d2

[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599) · 35b7a055
Kevin H. Luu authored Jan 31, 2025

35b7a055

Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258

Eldar Kurtic authored Feb 01, 2025

Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.

1867c258

[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161) · cb3e73e4

fade_away authored Feb 01, 2025

FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086

 #12487

---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

cb3e73e4

[V1] Bugfix: Validate Model Input Length (#12600) · b1340f9d

Robert Shaw authored Jan 31, 2025

SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len

FIX #12567(*link existing issues this PR will resolve*)

b1340f9d

31 Jan, 2025 15 commits

[Doc] int4 w4a16 example (#12585) · 44bbca78

Brian Dellabetta authored Jan 31, 2025

Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py

)

FIX #n/a (no issue created)

@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets

---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

44bbca78

[Doc] Improve installation signposting (#12575) · 60808bd4

Harry Mellor authored Jan 31, 2025

- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html


- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images

---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

60808bd4

[Feature] Fix guided decoding blocking bitmask memcpy (#12563) · fc542144

Ryan Nguyen authored Jan 31, 2025

**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7

)

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>

fc542144

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) · eb5741ad

Tyler Michael Smith authored Jan 31, 2025

Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868

 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

eb5741ad

[Bugfix] Revert MoE Triton Config Default (#12629) · 145c2ff6

Robert Shaw authored Jan 31, 2025

SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files

) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

145c2ff6

[release] Add input step to ask for Release version (#12631) · 415f1947
Kevin H. Luu authored Jan 31, 2025
```
Instead of having to create a new build with release version put in as
env var.
```
415f1947

[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603) · 89003c40

Chen Zhang authored Feb 01, 2025



This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
        request_id=0,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash1", "hash2"],
    )
    request2 = make_request(
        request_id=1,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash3", "hash2"],
    )
```

---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com>

89003c40

[Docs][V1] Prefix caching design (#12598) · 60bcef00

Cody Yu authored Jan 31, 2025



- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

60bcef00

[Git] Automatically sign-off commits (#12595) · 847f8832

Cody Yu authored Jan 31, 2025



It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just

```
pre-commit install
```

Now we need to

```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```

Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.

cc @hmellor @youkaichao

---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

847f8832

[BugFix] Fix Torch.Compile For DeepSeek (#12594) · 325f679f
Robert Shaw authored Jan 31, 2025
```
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
325f679f
Add favicon to docs (#12611) · e3f7ff65
Harry Mellor authored Jan 31, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
e3f7ff65
[Bugfix] Gracefully handle huggingface hub http error (#12571) · 7a8987da
Roger Wang authored Jan 31, 2025

7a8987da

[Attention] MLA decode optimizations (#12528) · cabaf4ef

Lucas Wilkinson authored Jan 31, 2025


Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

cabaf4ef

[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421) · a1fc18c0
Aleksandr Malyshev authored Jan 30, 2025
```
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
```
a1fc18c0
[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868) · 9798b2fb
Lucas Wilkinson authored Jan 30, 2025

9798b2fb

30 Jan, 2025 7 commits
- [V1][Log] Add max request concurrency log to V1 (#12569) · 4078052f
  Michael Goin authored Jan 30, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  4078052f
- [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555) · bd2107e3
  Nishidha authored Jan 31, 2025
```
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
```
  bd2107e3
- [Kernel] Triton Configs for Fp8 Block Quantization (#11589) · 9b0c4bab
  Robert Shaw authored Jan 30, 2025
```
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
  9b0c4bab
- [Misc] fix typo: add missing space in lora adapter error message (#12564) · 41bf5612
  Beim authored Jan 31, 2025
```
Signed-off-by: Beim <beim2015@outlook.com>
```
  41bf5612
- Set `?device={device}` when changing tab in installation guides (#12560) · a2769032
  Harry Mellor authored Jan 30, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  a2769032
- [V1][Metrics] Add GPU cache usage % gauge (#12561) · f17f1d46
  Mark McLoughlin authored Jan 30, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  f17f1d46
- [Misc][MoE] add Deepseek-V3 moe tuning support (#12558) · 1c1bb0bb
  Divakar Verma authored Jan 29, 2025
```
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
```
  1c1bb0bb
29 Jan, 2025 2 commits
- [V1][BugFix] Free encoder cache for aborted requests (#12545) · e0cc5f25
  Woosuk Kwon authored Jan 29, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  e0cc5f25
- Revert "[Build/CI] Fix libcuda.so linkage" (#12552) · 73aa6cfd
  Tyler Michael Smith authored Jan 29, 2025
  
  73aa6cfd