Commits · 022bcc701a948f96e68af678eee686837f393d07 · OpenDAS / vllm_cscc

05 Feb, 2025 13 commits
- [Bugfix] Fix 'ModuleNotFoundError: No module named... · 022bcc70
  Akash kaothalkar authored Feb 05, 2025
```
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1  (#12546)
```
  022bcc70
- [Doc] Remove performance warning for auto_awq.md (#12743) · c53dc466
  Michael Goin authored Feb 05, 2025
  
  c53dc466
- [V1][Misc] Shorten `FinishReason` enum and use constant strings (#12760) · 3d09e592
  Nick Hill authored Feb 04, 2025
  
  3d09e592
- [Bugfix] Fix OpenVINO model runner (#12750) · fcf2e3d7
  Harry Mellor authored Feb 05, 2025
  
  fcf2e3d7
- [Doc] Update PR Reminder with link to Developer Slack (#12748) · 58b218d7
  Michael Goin authored Feb 05, 2025
  
  58b218d7
- [Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634) · 7ff7a638
  Kyle Sayers authored Feb 05, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  7ff7a638
- [Misc] Bump the compressed-tensors version (#12736) · 686006a2
  Dipika Sikka authored Feb 04, 2025
  
  686006a2
- [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729) · 98fd089f
  Isotr0py authored Feb 05, 2025
  
  98fd089f
- Refactor `Linear` handling in `TransformersModel` (#12727) · 249824c3
  Harry Mellor authored Feb 05, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  249824c3
- [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713) · 64862d10
  Aleksandr Malyshev authored Feb 04, 2025
```
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
```
  64862d10
- [Core] add and implement `VLLM_LOGITS_PROCESSOR_THREADS` (#12368) · b3a0d01e
  Aviv Keshet authored Feb 04, 2025
```
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
```
  b3a0d01e
- [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676) · 75e94309
  Lucas Wilkinson authored Feb 04, 2025
```
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
```
  75e94309
- [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579) · 233df6f5
  Mark McLoughlin authored Feb 05, 2025
```
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
```
  233df6f5
04 Feb, 2025 13 commits
- [Bugfix] Fix CI failures for InternVL and Mantis models (#12728) · 18016a5e
  Cyrus Leung authored Feb 04, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  18016a5e
- [Build] update requirements of no-device for plugin usage (#12630) · 649550f2
  Sophie du Couédic authored Feb 04, 2025
```
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
```
  649550f2
- Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722) · 62467a83
  Kero Liang authored Feb 04, 2025
```
Signed-off-by: imkero <kerorek@outlook.com>
```
  62467a83
- [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689) · 6469038b
  Michael Greenbaum authored Feb 04, 2025
```
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com>
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>
```
  6469038b
- [VLM] merged multimodal processor and V1 support for idefics3 (#12660) · 815079de
  Isotr0py authored Feb 04, 2025
```
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  815079de
- [V1] Remove scheduling constraint on partial requests (#12674) · 18a88fcc
  Woosuk Kwon authored Feb 04, 2025
```
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  18a88fcc
- [VLM] Merged multi-modal processor for InternVL-based models (#12553) · d1ca7df8
  Cyrus Leung authored Feb 04, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
```
  d1ca7df8
- [Misc] Add BNB quantization for Whisper (#12381) · 96b23621
  Jee Jee Li authored Feb 04, 2025
```
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
```
  96b23621
- [AMD][ROCm] Enable DeepSeek model on ROCm (#12662) · c36ac98d
  Hongxia Yang authored Feb 04, 2025
```
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
```
  c36ac98d
- [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) · 4896d0c2
  Kyle Sayers authored Feb 04, 2025
  
  4896d0c2
- [Doc] Replace ibm-fms with ibm-ai-platform (#12709) · bb392af4
  Thomas Parnell authored Feb 04, 2025
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  bb392af4
- Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710) · 5d98d560
  Michael Goin authored Feb 03, 2025
```
Signed-off-by: mgoin <michael@neuralmagic.com>
```
  5d98d560
- [Core] Improve hash collision avoidance in prefix caching (#12621) · 73b35cca
  Russell Bryant authored Feb 03, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  73b35cca
03 Feb, 2025 14 commits

[V1] Revert `uncache_blocks` and support recaching full blocks (#12415) · 5095e966
Cody Yu authored Feb 03, 2025
```
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
```
5095e966
[MISC] Remove model input dumping when exception (#12582) · cf58b9c4
Cody Yu authored Feb 03, 2025
```
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
```
cf58b9c4
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) · 4797dad3
kushanam authored Feb 03, 2025

4797dad3
Squelch MLA warning for Compressed-Tensors Models (#12704) · 6dd5e528
Kyle Sayers authored Feb 03, 2025
```
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
```
6dd5e528
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696) · c11de33d
Tyler Michael Smith authored Feb 03, 2025
```
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
c11de33d
[Misc] Fix improper placement of SPDX header in scripts (#12694) · 33e0602e
Russell Bryant authored Feb 03, 2025
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
33e0602e

[Model]: Add `transformers` backend support (#11330) · a1a2aaad

Arthur authored Feb 03, 2025

# Adds support for `transformers` as a backend

Following https://github.com/huggingface/transformers/pull/35235

, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

a1a2aaad

[ci/build] fix gh200 test (#12681) · 1298a400
youkaichao authored Feb 03, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
1298a400

[cuda] manually import the correct pynvml module (#12679) · ad4a9dc8

youkaichao authored Feb 03, 2025

fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565



---------
Signed-off-by: youkaichao <youkaichao@gmail.com>

ad4a9dc8

Fix for attention layers to remain unquantized during moe_wn16 quant (#12570) · b9986454

Srikanth Srinivas authored Feb 02, 2025



Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

b9986454

Properly check if all fused layers are in the list of targets (#12666) · c5932e5d
Eldar Kurtic authored Feb 03, 2025
```
Thanks @kylesayrs for catching this!
```
c5932e5d

make sure mistral_common not imported for non-mistral models (#12669) · 20579c0f

youkaichao authored Feb 03, 2025

When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691

 .

I added the check, and make all imports of `cv2` lazy.

---------
Signed-off-by: youkaichao <youkaichao@gmail.com>

20579c0f

[Kernel] port sgl moe_align_block_size kernels (#12574) · 95460fc5

Yang Chen authored Feb 02, 2025

sgl_moe_align_block_size is based on:


https://github.com/sgl-project/sglang/commit/ded9fcd09a43d5e7d5bb31a2bc3e9fc21bf65d2a

moe_align_block_size is based on:


https://github.com/sgl-project/sglang/commit/ba5112ff691d791a9e38c6c71f59324a5fcb49d0

Signed-off-by: Yang Chen <yangche@fb.com>

95460fc5

[Doc] Deprecate Discord (#12668) · 326fcc8b
Zhuohan Li authored Feb 02, 2025

326fcc8b