Commits · ef9baee3c52f719df64a646db72b6c4ede8a29a0 · OpenDAS / vllm_cscc

28 Aug, 2024 2 commits
- [Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948) · ef9baee3
  Cyrus Leung authored Aug 28, 2024
  
  ef9baee3
- [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902) · fab5f53e
  Peter Salas authored Aug 27, 2024
  
  fab5f53e
21 Aug, 2024 1 commit
- [Model] Add UltravoxModel and UltravoxConfig (#7615) · 1ca0d4f8
  Peter Salas authored Aug 21, 2024
  
  1ca0d4f8
19 Aug, 2024 1 commit
- [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) · ff7ec82c
  SangBin Cho authored Aug 18, 2024
  
  ff7ec82c
14 Aug, 2024 1 commit
- [VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126) · 3f674a49
  Cyrus Leung authored Aug 15, 2024
  
  3f674a49
13 Aug, 2024 2 commits
- [Frontend][Core] Add plumbing to support audio language models (#7446) · 00c3d68e
  Peter Salas authored Aug 13, 2024
  
  00c3d68e
- [Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) · 7025b11d
  Cyrus Leung authored Aug 13, 2024
  
  7025b11d
01 Aug, 2024 1 commit
- [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758) · 630dd9e0
  Travis Johnson authored Jul 31, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
```
  630dd9e0
24 Jul, 2024 1 commit
- [Bugfix] Fix token padding for chameleon (#6724) · 0a740a11
  Roger Wang authored Jul 24, 2024
  
  0a740a11
23 Jul, 2024 2 commits
- Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon (#6690) · 1bedf210
  Roger Wang authored Jul 23, 2024
  
  1bedf210
- [VLM][Model] Support image input for Chameleon (#6633) · 22fa2e35
  Roger Wang authored Jul 22, 2024
  
  22fa2e35
22 Jul, 2024 1 commit
- [Model] Initial Support for Chameleon (#5770) · c9eef37f
  Roger Wang authored Jul 21, 2024
  
  c9eef37f
19 Jul, 2024 1 commit
- [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515) · dbe55885
  Robert Shaw authored Jul 18, 2024
  
  dbe55885
18 Jul, 2024 1 commit
- [Model] Support Mistral-Nemo (#6548) · 15c6a079
  Michael Goin authored Jul 18, 2024
  
  15c6a079
17 Jul, 2024 1 commit
- [Distributed][PP] only create embedding & lm head when necessary (#6455) · 1d094fd7
  Wushi Dong authored Jul 16, 2024
```
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
```
  1d094fd7
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
15 Jul, 2024 1 commit
- [core][distributed] simplify code to support pipeline parallel (#6406) · 69672f11
  youkaichao authored Jul 14, 2024
  
  69672f11
02 Jul, 2024 2 commits
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
- [Core] Pipeline Parallel Support (#4412) · c5832d2a
  Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  c5832d2a
27 Jun, 2024 2 commits
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) · 98cf2ed6
  Cyrus Leung authored Jun 28, 2024
  
  98cf2ed6
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

22 May, 2024 2 commits
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) · a36de682
  Philipp Moritz authored May 22, 2024
  
  a36de682
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

13 May, 2024 1 commit
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
25 Apr, 2024 1 commit
- [Model] Adds Phi-3 support (#4298) · 96e90fde
  Caio Mendes authored Apr 25, 2024
  
  96e90fde
16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

08 Apr, 2024 1 commit
- [Bugfix] Enable Proper `attention_bias` Usage in Llama Model Configuration (#3767) · bc0c0192
  Kiran R authored Apr 09, 2024
```
Co-authored-by: roy <jasonailu87@gmail.com>
```
  bc0c0192
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

25 Mar, 2024 3 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
20 Mar, 2024 1 commit
- Migrate `logits` computation and gather to `model_runner` (#3233) · f1c0fc39
  Roy authored Mar 21, 2024
  
  f1c0fc39
07 Mar, 2024 1 commit
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab