Commits · 1867c258bda3bc6adb07090c508fd85e3ceed547 · OpenDAS / vllm_cscc

01 Feb, 2025 3 commits

Fix target matching for fused layers with compressed-tensors (#12617) · 1867c258

Eldar Kurtic authored Feb 01, 2025

Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.

1867c258

[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161) · cb3e73e4

fade_away authored Feb 01, 2025

FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086

 #12487

---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

cb3e73e4

[V1] Bugfix: Validate Model Input Length (#12600) · b1340f9d

Robert Shaw authored Jan 31, 2025

SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len

FIX #12567(*link existing issues this PR will resolve*)

b1340f9d

31 Jan, 2025 15 commits

[Doc] int4 w4a16 example (#12585) · 44bbca78

Brian Dellabetta authored Jan 31, 2025

Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py

)

FIX #n/a (no issue created)

@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets

---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

44bbca78

[Doc] Improve installation signposting (#12575) · 60808bd4

Harry Mellor authored Jan 31, 2025

- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html


- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images

---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

60808bd4

[Feature] Fix guided decoding blocking bitmask memcpy (#12563) · fc542144

Ryan Nguyen authored Jan 31, 2025

**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7

)

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>

fc542144

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) · eb5741ad

Tyler Michael Smith authored Jan 31, 2025

Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868

 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

eb5741ad

[Bugfix] Revert MoE Triton Config Default (#12629) · 145c2ff6

Robert Shaw authored Jan 31, 2025

SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files

) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

145c2ff6

[release] Add input step to ask for Release version (#12631) · 415f1947
Kevin H. Luu authored Jan 31, 2025
```
Instead of having to create a new build with release version put in as
env var.
```
415f1947

[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603) · 89003c40

Chen Zhang authored Feb 01, 2025



This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
        request_id=0,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash1", "hash2"],
    )
    request2 = make_request(
        request_id=1,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash3", "hash2"],
    )
```

---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com>

89003c40

[Docs][V1] Prefix caching design (#12598) · 60bcef00

Cody Yu authored Jan 31, 2025



- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

60bcef00

[Git] Automatically sign-off commits (#12595) · 847f8832

Cody Yu authored Jan 31, 2025



It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just

```
pre-commit install
```

Now we need to

```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```

Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.

cc @hmellor @youkaichao

---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

847f8832

[BugFix] Fix Torch.Compile For DeepSeek (#12594) · 325f679f
Robert Shaw authored Jan 31, 2025
```
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
325f679f
Add favicon to docs (#12611) · e3f7ff65
Harry Mellor authored Jan 31, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
e3f7ff65
[Bugfix] Gracefully handle huggingface hub http error (#12571) · 7a8987da
Roger Wang authored Jan 31, 2025

7a8987da

[Attention] MLA decode optimizations (#12528) · cabaf4ef

Lucas Wilkinson authored Jan 31, 2025


Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>

cabaf4ef

[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421) · a1fc18c0
Aleksandr Malyshev authored Jan 30, 2025
```
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
```
a1fc18c0
[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868) · 9798b2fb
Lucas Wilkinson authored Jan 30, 2025

9798b2fb

30 Jan, 2025 7 commits
- [V1][Log] Add max request concurrency log to V1 (#12569) · 4078052f
  Michael Goin authored Jan 30, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  4078052f
- [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555) · bd2107e3
  Nishidha authored Jan 31, 2025
```
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
```
  bd2107e3
- [Kernel] Triton Configs for Fp8 Block Quantization (#11589) · 9b0c4bab
  Robert Shaw authored Jan 30, 2025
```
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
  9b0c4bab
- [Misc] fix typo: add missing space in lora adapter error message (#12564) · 41bf5612
  Beim authored Jan 31, 2025
```
Signed-off-by: Beim <beim2015@outlook.com>
```
  41bf5612
- Set `?device={device}` when changing tab in installation guides (#12560) · a2769032
  Harry Mellor authored Jan 30, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  a2769032
- [V1][Metrics] Add GPU cache usage % gauge (#12561) · f17f1d46
  Mark McLoughlin authored Jan 30, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  f17f1d46
- [Misc][MoE] add Deepseek-V3 moe tuning support (#12558) · 1c1bb0bb
  Divakar Verma authored Jan 29, 2025
```
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
```
  1c1bb0bb
29 Jan, 2025 14 commits

[V1][BugFix] Free encoder cache for aborted requests (#12545) · e0cc5f25
Woosuk Kwon authored Jan 29, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
e0cc5f25
Revert "[Build/CI] Fix libcuda.so linkage" (#12552) · 73aa6cfd
Tyler Michael Smith authored Jan 29, 2025

73aa6cfd
[Kernel] add triton fused moe kernel for gptq/awq (#12185) · 27b78c73
Jinzhen Lin authored Jan 29, 2025

27b78c73
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787) · b02fd288
Pavani Majety authored Jan 29, 2025
```
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
b02fd288
[Frontend] Support override generation config in args (#12409) · ff7424f4
Yanyi Liu authored Jan 29, 2025
```
Signed-off-by: liuyanyi <wolfsonliu@163.com>
```
ff7424f4

[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069) · d93bf4da

Alphi authored Jan 29, 2025


Signed-off-by: hzh <hezhihui_thu@163.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

d93bf4da

[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347) · 036ca94c

Travis Johnson authored Jan 29, 2025


Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Wallas Santos <wallashss@ibm.com>

036ca94c

Fix the pydantic logging validator (#12420) · ef001d98
Maximilien de Bayser authored Jan 29, 2025
```
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
```
ef001d98
[V1] Improve Error Message for Unsupported Config (#12535) · 5f671cb4
Robert Shaw authored Jan 28, 2025
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
5f671cb4
Bugfix for whisper quantization due to fake k_proj bias (#12524) · bd02164c
Michael Goin authored Jan 28, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
bd02164c
[V1][Metrics] Add TTFT and TPOT histograms (#12530) · 46fb0567
Mark McLoughlin authored Jan 29, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
46fb0567
[Doc] Convert docs to use colon fences (#12471) · dd6a3a02
Harry Mellor authored Jan 29, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
dd6a3a02

[Frontend] Support reasoning content for deepseek r1 (#12473) · a7e3eba6

Ce Gao authored Jan 29, 2025


Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>

a7e3eba6

[TPU] Add example for profiling TPU inference (#12531) · fbb5bd4c
Michael Goin authored Jan 28, 2025
```
Signed-off-by: mgoin <mgoin@redhat.com>
```
fbb5bd4c

28 Jan, 2025 1 commit
- [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482) · 80fcc3ed
  fenghuizhang authored Jan 28, 2025
```
Signed-off-by: Fenghui Zhang <fhzhang@google.com>
```
  80fcc3ed