Commits · f17a1a8f9665bb237a3dddda7dc93f259e5e81e0 · OpenDAS / vllm_cscc · GitLab

25 May, 2024 3 commits
- [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) · d5a16977
  Lily Liu authored May 25, 2024
  
  d5a16977
- [Misc] add logging level env var (#5045) · 325c1199
  youkaichao authored May 24, 2024
  
  325c1199
- [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9
  Eric Xihui Lin authored May 25, 2024
```
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  8e192ff9
24 May, 2024 2 commits
- [Core][Bugfix]: fix prefix caching for blockv2 (#4764) · e64fde4b
  leiwen83 authored May 25, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  e64fde4b
- [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) · 91977095
  Robert Shaw authored May 24, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  91977095
23 May, 2024 4 commits
- [Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985) · e3470f87
  Elisei Smirnov authored May 24, 2024
```
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>
```
  e3470f87
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- [Core][1/N] Support send/recv in PyNCCL Groups (#4988) · 5eda2ea0
  Murali Andoorveedu authored May 23, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  5eda2ea0
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 7 commits
- [Misc] Take user preference in attention selector (#4960) · ee3eea0a
  Cody Yu authored May 22, 2024
  
  ee3eea0a
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) · a36de682
  Philipp Moritz authored May 22, 2024
  
  a36de682
- [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) · eb6d3c26
  Nick Hill authored May 22, 2024
  
  eb6d3c26
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [misc] remove comments that were supposed to be removed (#4977) · c74c913b
  SangBin Cho authored May 22, 2024
  
  c74c913b
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
21 May, 2024 5 commits
- [Bugfix][Kernel] Add head size check for attention backend selection (#4944) · 99eff67b
  Isotr0py authored May 22, 2024
  
  99eff67b
- [Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935) · 14772eeb
  Kante Yin authored May 22, 2024
```
Signed-off-by: kerthcet <kerthcet@gmail.com>
```
  14772eeb
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
- [Model] add rope_scaling support for qwen2 (#4930) · d130b573
  HUANG Fei authored May 21, 2024
  
  d130b573
- [Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897) · 65ae8c2c
  Antoni Baum authored May 20, 2024
  
  65ae8c2c
20 May, 2024 5 commits
- [Core] Sharded State Loader download from HF (#4889) · 1937e298
  Aurick Qiao authored May 20, 2024
  
  1937e298
- [Bugfix] Fix dummy weight for fp8 (#4916) · f0eecee6
  Mor Zusman authored May 20, 2024
```
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  f0eecee6
- [Misc]: allow user to specify port in distributed setting (#4914) · 546a97ef
  Wenwei Zhang authored May 21, 2024
  
  546a97ef
- [Model] LLaVA model refactor (#4910) · 6287537a
  Cyrus Leung authored May 20, 2024
  
  6287537a
- [Kernel] Add flash-attn back (#4907) · b57e6c59
  Woosuk Kwon authored May 19, 2024
  
  b57e6c59
19 May, 2024 2 commits
- [Kernel] Add marlin_24 unit tests (#4901) · 27ce8547
  Alexander Matveev authored May 19, 2024
  
  27ce8547
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 2 commits

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used (#4658) · c0724fc9
alexeykondrat authored May 18, 2024

c0724fc9

17 May, 2024 4 commits
- Sync huggingface modifications of qwen Moe model (#4774) · 48d5985a
  eigenLiu authored May 18, 2024
  
  48d5985a
- [Bugfix] fix rope error when load models with different dtypes (#4835) · 33e0823d
  Jinzhen Lin authored May 17, 2024
  
  33e0823d
- [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness,... · 26148120
  Alexei-V-Ivanov-AMD authored May 16, 2024
```
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
```
  26148120
- [Frontend] OpenAI API server: Do not add bos token by default when encoding (#4688) · 0150a106
  bofeng huang authored May 17, 2024
  
  0150a106
16 May, 2024 6 commits
- [Bugfix] Fix FP8 KV cache support (#4869) · 9a31a817
  Woosuk Kwon authored May 16, 2024
  
  9a31a817
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- [Misc] remove old comments (#4866) · 10fa9eea
  youkaichao authored May 16, 2024
  
  10fa9eea
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808
- [ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845) · b5853f99
  Hongxia Yang authored May 16, 2024
  
  b5853f99
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3