Commits · f775a07e30fdeafc14f53fe502b262b00540dd71 · OpenDAS / vllm_cscc

03 Jun, 2024 3 commits
- [FRONTEND] OpenAI `tools` support named functions (#5032) · f775a07e
  Breno Faria authored Jun 04, 2024
  
  f775a07e
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
02 Jun, 2024 1 commit
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
01 Jun, 2024 3 commits
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) · c3540728
  Ye Cao authored Jun 02, 2024
```
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
```
  c3540728
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 2 commits
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
- [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) · b35be540
  Robert Shaw authored May 30, 2024
  
  b35be540
30 May, 2024 1 commit
- [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) · 5bf185a1
  Alexander Matveev authored May 29, 2024
  
  5bf185a1
28 May, 2024 1 commit
- [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951) · dd8de11f
  Divakar Verma authored May 28, 2024
```
This PR adds Triton kernel configs for the MoE kernel for MI300X
```
  dd8de11f
27 May, 2024 3 commits
- [Model] Add support for falcon-11B (#5069) · 890aa93d
  Isotr0py authored May 28, 2024
  
  890aa93d
- [Core] Allow AQLM on Pascal (#5058) · fbdb7b3e
  sasha0552 authored May 27, 2024
  
  fbdb7b3e
- [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2
  Zhuohan Li authored May 27, 2024
```
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  1102bef2
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

24 May, 2024 1 commit
- [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) · 91977095
  Robert Shaw authored May 24, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  91977095
23 May, 2024 3 commits
- [Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985) · e3470f87
  Elisei Smirnov authored May 24, 2024
```
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>
```
  e3470f87
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 3 commits
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) · a36de682
  Philipp Moritz authored May 22, 2024
  
  a36de682
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
21 May, 2024 2 commits
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
- [Model] add rope_scaling support for qwen2 (#4930) · d130b573
  HUANG Fei authored May 21, 2024
  
  d130b573
20 May, 2024 3 commits
- [Core] Sharded State Loader download from HF (#4889) · 1937e298
  Aurick Qiao authored May 20, 2024
  
  1937e298
- [Bugfix] Fix dummy weight for fp8 (#4916) · f0eecee6
  Mor Zusman authored May 20, 2024
```
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  f0eecee6
- [Model] LLaVA model refactor (#4910) · 6287537a
  Cyrus Leung authored May 20, 2024
  
  6287537a
19 May, 2024 2 commits
- [Kernel] Add marlin_24 unit tests (#4901) · 27ce8547
  Alexander Matveev authored May 19, 2024
  
  27ce8547
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 2 commits
- Sync huggingface modifications of qwen Moe model (#4774) · 48d5985a
  eigenLiu authored May 18, 2024
  
  48d5985a
- [Bugfix] fix rope error when load models with different dtypes (#4835) · 33e0823d
  Jinzhen Lin authored May 17, 2024
  
  33e0823d
16 May, 2024 4 commits
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 1 commit

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

13 May, 2024 3 commits
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc