Commits · c35407282878cb3a42860d584a4d9eb6aed82299 · OpenDAS / vllm_cscc

"...entrypoints/openai/completion/test_prompt_validation.py" did not exist on "f0a1c8453ad1c664c8a04c83fe545195fcd556eb"

01 Jun, 2024 2 commits
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) · c3540728
  Ye Cao authored Jun 02, 2024
```
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
```
  c3540728
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 2 commits
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
- [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) · b35be540
  Robert Shaw authored May 30, 2024
  
  b35be540
30 May, 2024 1 commit
- [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) · 5bf185a1
  Alexander Matveev authored May 29, 2024
  
  5bf185a1
28 May, 2024 1 commit
- [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951) · dd8de11f
  Divakar Verma authored May 28, 2024
```
This PR adds Triton kernel configs for the MoE kernel for MI300X
```
  dd8de11f
27 May, 2024 3 commits
- [Model] Add support for falcon-11B (#5069) · 890aa93d
  Isotr0py authored May 28, 2024
  
  890aa93d
- [Core] Allow AQLM on Pascal (#5058) · fbdb7b3e
  sasha0552 authored May 27, 2024
  
  fbdb7b3e
- [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2
  Zhuohan Li authored May 27, 2024
```
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  1102bef2
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

24 May, 2024 1 commit
- [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) · 91977095
  Robert Shaw authored May 24, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  91977095
23 May, 2024 3 commits
- [Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985) · e3470f87
  Elisei Smirnov authored May 24, 2024
```
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>
```
  e3470f87
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 3 commits
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) · a36de682
  Philipp Moritz authored May 22, 2024
  
  a36de682
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
21 May, 2024 2 commits
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
- [Model] add rope_scaling support for qwen2 (#4930) · d130b573
  HUANG Fei authored May 21, 2024
  
  d130b573
20 May, 2024 3 commits
- [Core] Sharded State Loader download from HF (#4889) · 1937e298
  Aurick Qiao authored May 20, 2024
  
  1937e298
- [Bugfix] Fix dummy weight for fp8 (#4916) · f0eecee6
  Mor Zusman authored May 20, 2024
```
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  f0eecee6
- [Model] LLaVA model refactor (#4910) · 6287537a
  Cyrus Leung authored May 20, 2024
  
  6287537a
19 May, 2024 2 commits
- [Kernel] Add marlin_24 unit tests (#4901) · 27ce8547
  Alexander Matveev authored May 19, 2024
  
  27ce8547
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 2 commits
- Sync huggingface modifications of qwen Moe model (#4774) · 48d5985a
  eigenLiu authored May 18, 2024
  
  48d5985a
- [Bugfix] fix rope error when load models with different dtypes (#4835) · 33e0823d
  Jinzhen Lin authored May 17, 2024
  
  33e0823d
16 May, 2024 4 commits
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 1 commit

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

13 May, 2024 4 commits
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
- [CORE] Improvement in ranks code (#4718) · a7be4d00
  Swapnil Parekh authored May 12, 2024
  
  a7be4d00
12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 1 commit

[Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172

SangBin Cho authored May 10, 2024

Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.

When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.

I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.

6a0f6172

09 May, 2024 1 commit

[Kernel] [FP8] Improve FP8 linear layer performance (#4691) · 379da6dc

Philipp Moritz authored May 09, 2024

This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).

We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.

Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:

qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)

379da6dc