Commits · 9a31a817a85ac4249bf82dd8b6f90ef6b8e81fef · OpenDAS / vllm_cscc

16 May, 2024 15 commits
- [Bugfix] Fix FP8 KV cache support (#4869) · 9a31a817
  Woosuk Kwon authored May 16, 2024
  
  9a31a817
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) · 8435b207
  Silencio authored May 17, 2024
```
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
```
  8435b207
- [Misc] remove old comments (#4866) · 10fa9eea
  youkaichao authored May 16, 2024
  
  10fa9eea
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808
- [ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845) · b5853f99
  Hongxia Yang authored May 16, 2024
  
  b5853f99
- Add JSON output support for benchmark_latency and benchmark_throughput (#4848) · f09edd8a
  Simon Mo authored May 16, 2024
  
  f09edd8a
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Bugfix] Bypass authorization API token for preflight requests (#4862) · 9216b9cc
  Pierre Dulac authored May 16, 2024
  
  9216b9cc
- [Frontend] Separate OpenAI Batch Runner usage from API Server (#4851) · 5e0391c0
  Alex Wu authored May 16, 2024
  
  5e0391c0
- [docs] Fix typo in examples filename openi -> openai (#4864) · dbc0754d
  Alex Wu authored May 16, 2024
  
  dbc0754d
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 7 commits

[Frontend] Support OpenAI batch file format (#4794) · 52f8107c
Alex Wu authored May 15, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
52f8107c
[Frontend] Re-enable custom roles in Chat Completions API (#4758) · fc0d9dfc
Cyrus Leung authored May 16, 2024

fc0d9dfc
[Doc] Highlight the fourth meetup in the README (#4842) · 361c461a
Zhuohan Li authored May 15, 2024

361c461a
[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
zifeitong authored May 15, 2024

a5675d34
[CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166) · e9cdd2b1
Cyrus Leung authored May 15, 2024

e9cdd2b1

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

14 May, 2024 6 commits
- Add 4th meetup announcement to readme (#4817) · 29bc01bf
  Simon Mo authored May 14, 2024
  
  29bc01bf
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
- [Bugfix][Doc] Fix CI failure in docs (#4804) · dc72402b
  Cyrus Leung authored May 15, 2024
```
This PR fixes the CI failure introduced by #4798.

The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.

I have also changed the format of the links to be more distinct from each other.
```
  dc72402b
- [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by... · ccb63a82
  Kuntai Du authored May 14, 2024
```
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696)
```
  ccb63a82
- [Doc] Add meetups to the doc (#4798) · c579b750
  Zhuohan Li authored May 13, 2024
  
  c579b750
- [Doc] Add API reference for offline inference (#4710) · 4bfa7e7f
  Cyrus Leung authored May 14, 2024
  
  4bfa7e7f
13 May, 2024 11 commits
- [Doc] Shorten README by removing supported model list (#4796) · ac1fbf7f
  Zhuohan Li authored May 13, 2024
  
  ac1fbf7f
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Kernel] Use flash-attn for decoding (#3648) · 1356df53
  Stephen Krider authored May 13, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
```
  1356df53
- [Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
  Cody Yu authored May 13, 2024
  
  ce532ff4
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
- [Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
  SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  e7c46b95
- [CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) · 350f9e10
  Cyrus Leung authored May 13, 2024
```
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
```
  350f9e10
- [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) · 702bee46
  youkaichao authored May 12, 2024
  
  702bee46
- [CORE] Improvement in ranks code (#4718) · a7be4d00
  Swapnil Parekh authored May 12, 2024
  
  a7be4d00
- [CI/Build] Tweak Marlin Nondeterminism Issues (#4713) · a709e87a
  Robert Shaw authored May 12, 2024
  
  a709e87a
12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73