Commits · 6287537a0c970bda1fc8b31f2bde1bcf2d26e151 · OpenDAS / vllm_cscc

20 May, 2024 2 commits
- [Model] LLaVA model refactor (#4910) · 6287537a
  Cyrus Leung authored May 20, 2024
  
  6287537a
- [Kernel] Add flash-attn back (#4907) · b57e6c59
  Woosuk Kwon authored May 19, 2024
  
  b57e6c59
19 May, 2024 2 commits
- [Kernel] Add marlin_24 unit tests (#4901) · 27ce8547
  Alexander Matveev authored May 19, 2024
  
  27ce8547
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 2 commits

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used (#4658) · c0724fc9
alexeykondrat authored May 18, 2024

c0724fc9

17 May, 2024 4 commits
- Sync huggingface modifications of qwen Moe model (#4774) · 48d5985a
  eigenLiu authored May 18, 2024
  
  48d5985a
- [Bugfix] fix rope error when load models with different dtypes (#4835) · 33e0823d
  Jinzhen Lin authored May 17, 2024
  
  33e0823d
- [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness,... · 26148120
  Alexei-V-Ivanov-AMD authored May 16, 2024
```
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
```
  26148120
- [Frontend] OpenAI API server: Do not add bos token by default when encoding (#4688) · 0150a106
  bofeng huang authored May 17, 2024
  
  0150a106
16 May, 2024 12 commits
- [Bugfix] Fix FP8 KV cache support (#4869) · 9a31a817
  Woosuk Kwon authored May 16, 2024
  
  9a31a817
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- [Misc] remove old comments (#4866) · 10fa9eea
  youkaichao authored May 16, 2024
  
  10fa9eea
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808
- [ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845) · b5853f99
  Hongxia Yang authored May 16, 2024
  
  b5853f99
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Bugfix] Bypass authorization API token for preflight requests (#4862) · 9216b9cc
  Pierre Dulac authored May 16, 2024
  
  9216b9cc
- [Frontend] Separate OpenAI Batch Runner usage from API Server (#4851) · 5e0391c0
  Alex Wu authored May 16, 2024
  
  5e0391c0
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 5 commits

[Frontend] Support OpenAI batch file format (#4794) · 52f8107c
Alex Wu authored May 15, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
52f8107c
[Frontend] Re-enable custom roles in Chat Completions API (#4758) · fc0d9dfc
Cyrus Leung authored May 16, 2024

fc0d9dfc
[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
zifeitong authored May 15, 2024

a5675d34

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

14 May, 2024 2 commits
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
- [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by... · ccb63a82
  Kuntai Du authored May 14, 2024
```
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696)
```
  ccb63a82
13 May, 2024 9 commits

[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
Philipp Moritz authored May 13, 2024

33d3914b

[Kernel] Use flash-attn for decoding (#3648) · 1356df53

Stephen Krider authored May 13, 2024


Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

1356df53

[Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
Cody Yu authored May 13, 2024

ce532ff4
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
8bc68e19
[Misc] Enhance attention selector (#4751) · 0fca3cdc
Woosuk Kwon authored May 13, 2024

0fca3cdc
[Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
e7c46b95

[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) · 350f9e10

Cyrus Leung authored May 13, 2024

Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

350f9e10

[Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) · 702bee46
youkaichao authored May 12, 2024

702bee46
[CORE] Improvement in ranks code (#4718) · a7be4d00
Swapnil Parekh authored May 12, 2024

a7be4d00

12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b