Commits · 64862d106efa78032702f5fa5c110ccd6d654e9a · OpenDAS / vllm_cscc

05 Feb, 2025 4 commits
- [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713) · 64862d10
  Aleksandr Malyshev authored Feb 04, 2025
```
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
```
  64862d10
- [Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` (#12368) · b3a0d01e
  Aviv Keshet authored Feb 04, 2025
```
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
```
  b3a0d01e
- [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676) · 75e94309
  Lucas Wilkinson authored Feb 04, 2025
```
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
  75e94309
- [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579) · 233df6f5
  Mark McLoughlin authored Feb 05, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  233df6f5
04 Feb, 2025 13 commits
- [Bugfix] Fix CI failures for InternVL and Mantis models (#12728) · 18016a5e
  Cyrus Leung authored Feb 04, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  18016a5e
- [Build] update requirements of no-device for plugin usage (#12630) · 649550f2
  Sophie du Couédic authored Feb 04, 2025
```
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
```
  649550f2
- Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722) · 62467a83
  Kero Liang authored Feb 04, 2025
```
Signed-off-by: imkero <kerorek@outlook.com>
```
  62467a83
- [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689) · 6469038b
  Michael Greenbaum authored Feb 04, 2025
```
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com>
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>
```
  6469038b
- [VLM] merged multimodal processor and V1 support for idefics3 (#12660) · 815079de
  Isotr0py authored Feb 04, 2025
```
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  815079de
- [V1] Remove scheduling constraint on partial requests (#12674) · 18a88fcc
  Woosuk Kwon authored Feb 04, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  18a88fcc
- [VLM] Merged multi-modal processor for InternVL-based models (#12553) · d1ca7df8
  Cyrus Leung authored Feb 04, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
```
  d1ca7df8
- [Misc] Add BNB quantization for Whisper (#12381) · 96b23621
  Jee Jee Li authored Feb 04, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  96b23621
- [AMD][ROCm] Enable DeepSeek model on ROCm (#12662) · c36ac98d
  Hongxia Yang authored Feb 04, 2025
```
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
```
  c36ac98d
- [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) · 4896d0c2
  Kyle Sayers authored Feb 04, 2025
  
  4896d0c2
- [Doc] Replace ibm-fms with ibm-ai-platform (#12709) · bb392af4
  Thomas Parnell authored Feb 04, 2025
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  bb392af4
- Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710) · 5d98d560
  Michael Goin authored Feb 03, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  5d98d560
- [Core] Improve hash collision avoidance in prefix caching (#12621) · 73b35cca
  Russell Bryant authored Feb 03, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  73b35cca
03 Feb, 2025 15 commits

[V1] Revert `uncache_blocks` and support recaching full blocks (#12415) · 5095e966
Cody Yu authored Feb 03, 2025
```
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
```
5095e966
[MISC] Remove model input dumping when exception (#12582) · cf58b9c4
Cody Yu authored Feb 03, 2025
```
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
```
cf58b9c4
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) · 4797dad3
kushanam authored Feb 03, 2025

4797dad3
Squelch MLA warning for Compressed-Tensors Models (#12704) · 6dd5e528
Kyle Sayers authored Feb 03, 2025
```
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
```
6dd5e528
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696) · c11de33d
Tyler Michael Smith authored Feb 03, 2025
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
c11de33d
[Misc] Fix improper placement of SPDX header in scripts (#12694) · 33e0602e
Russell Bryant authored Feb 03, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
33e0602e

[Model]: Add `transformers` backend support (#11330) · a1a2aaad

Arthur authored Feb 03, 2025

# Adds support for `transformers` as a backend

Following https://github.com/huggingface/transformers/pull/35235

, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

a1a2aaad

[ci/build] fix gh200 test (#12681) · 1298a400
youkaichao authored Feb 03, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
1298a400

[cuda] manually import the correct pynvml module (#12679) · ad4a9dc8

youkaichao authored Feb 03, 2025

fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565



---------
Signed-off-by: youkaichao <youkaichao@gmail.com>

ad4a9dc8

Fix for attention layers to remain unquantized during moe_wn16 quant (#12570) · b9986454

Srikanth Srinivas authored Feb 02, 2025



Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

b9986454

Properly check if all fused layers are in the list of targets (#12666) · c5932e5d
Eldar Kurtic authored Feb 03, 2025
```
Thanks @kylesayrs for catching this!
```
c5932e5d

make sure mistral_common not imported for non-mistral models (#12669) · 20579c0f

youkaichao authored Feb 03, 2025

When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691

 .

I added the check, and make all imports of `cv2` lazy.

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>

20579c0f

[Kernel] port sgl moe_align_block_size kernels (#12574) · 95460fc5

Yang Chen authored Feb 02, 2025

sgl_moe_align_block_size is based on:


https://github.com/sgl-project/sglang/commit/ded9fcd09a43d5e7d5bb31a2bc3e9fc21bf65d2a

moe_align_block_size is based on:


https://github.com/sgl-project/sglang/commit/ba5112ff691d791a9e38c6c71f59324a5fcb49d0

Signed-off-by: Yang Chen <yangche@fb.com>

95460fc5

[Doc] Deprecate Discord (#12668) · 326fcc8b
Zhuohan Li authored Feb 02, 2025

326fcc8b

[doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667) · e6433091

youkaichao authored Feb 03, 2025

As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815

 becomes more
frequent. Let's give clear message to users.
Signed-off-by: youkaichao <youkaichao@gmail.com>

e6433091

02 Feb, 2025 6 commits

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

[Hardware][Intel GPU] add XPU bf16 support (#12392) · f256ebe4
Kunshang Ji authored Feb 02, 2025
```
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
```
f256ebe4

[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) · f8ece6e1

Shawn Du authored Feb 02, 2025

As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254

,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------
Signed-off-by: Shawn Du <shawnd200@outlook.com>

f8ece6e1

[V1][Minor] Avoid frequently creating ConstantList (#12653) · abfcdcdf

Woosuk Kwon authored Feb 01, 2025



A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

abfcdcdf

[Core] Silence unnecessary deprecation warnings (#12620) · e497f334

Russell Bryant authored Feb 02, 2025



I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:

```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
     and will be removed in a future version.
     Please use 'lora_path' instead.
```

The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e497f334

[Bugfix] fix moe_wna16 get_quant_method (#12648) · baaa2b24

Jinzhen Lin authored Feb 02, 2025

Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.

https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

baaa2b24

01 Feb, 2025 2 commits
- doc: fixing minor typo in readme.md (#12643) · b4e5c033
  Vicente Herrera authored Feb 01, 2025
```
Word "evolved" was mistyped
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

---------
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
```
  b4e5c033
- Apply torch.compile to fused_moe/grouped_topk (#12637) · 3194039c
  Michael Goin authored Feb 01, 2025
  
  3194039c