Commits · 8a7cc254a064b8d42bf4de7a9c3f29552240dfd9 · OpenDAS / vllm_cscc

15 May, 2024 1 commit

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

14 May, 2024 6 commits
- Add 4th meetup announcement to readme (#4817) · 29bc01bf
  Simon Mo authored May 14, 2024
  
  29bc01bf
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
- [Bugfix][Doc] Fix CI failure in docs (#4804) · dc72402b
  Cyrus Leung authored May 15, 2024
```
This PR fixes the CI failure introduced by #4798.

The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.

I have also changed the format of the links to be more distinct from each other.
```
  dc72402b
- [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by... · ccb63a82
  Kuntai Du authored May 14, 2024
```
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696)
```
  ccb63a82
- [Doc] Add meetups to the doc (#4798) · c579b750
  Zhuohan Li authored May 13, 2024
  
  c579b750
- [Doc] Add API reference for offline inference (#4710) · 4bfa7e7f
  Cyrus Leung authored May 14, 2024
  
  4bfa7e7f
13 May, 2024 11 commits
- [Doc] Shorten README by removing supported model list (#4796) · ac1fbf7f
  Zhuohan Li authored May 13, 2024
  
  ac1fbf7f
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Kernel] Use flash-attn for decoding (#3648) · 1356df53
  Stephen Krider authored May 13, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
```
  1356df53
- [Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
  Cody Yu authored May 13, 2024
  
  ce532ff4
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
- [Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
  SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  e7c46b95
- [CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) · 350f9e10
  Cyrus Leung authored May 13, 2024
```
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
```
  350f9e10
- [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) · 702bee46
  youkaichao authored May 12, 2024
  
  702bee46
- [CORE] Improvement in ranks code (#4718) · a7be4d00
  Swapnil Parekh authored May 12, 2024
  
  a7be4d00
- [CI/Build] Tweak Marlin Nondeterminism Issues (#4713) · a709e87a
  Robert Shaw authored May 12, 2024
  
  a709e87a
12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 11 commits
- [Core][Test] fix function name typo in custom allreduce (#4750) · 4e121310
  youkaichao authored May 10, 2024
  
  4e121310
- [CI] Nits for bad initialization of SeqGroup in testing (#4748) · fcc2994b
  Robert Shaw authored May 10, 2024
  
  fcc2994b
- [Speculative decoding] CUDA graph support (#4295) · 2e7796f2
  heeju-kim2 authored May 11, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  2e7796f2
- [Bugfix] Fix CLI arguments in OpenAI server docs (#4729) · 706588a7
  Allen.Dou authored May 10, 2024
  
  706588a7
- [Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172
  SangBin Cho authored May 10, 2024
```
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.

When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.

I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
```
  6a0f6172
- [Misc] Apply a couple g++ cleanups (#4719) · dac6a3f6
  Steve Grubb authored May 10, 2024
  
  dac6a3f6
- [Core]fix type annotation for `swap_blocks` (#4726) · 64b77dfd
  Kunshang Ji authored May 10, 2024
  
  64b77dfd
- chunked-prefill-doc-syntax (#4603) · 51d4094f
  Simon Mo authored May 09, 2024
```
Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html

Co-authored-by: sang <rkooo567@gmail.com>
```
  51d4094f
- [Misc] Keep only one implementation of the create_dummy_prompt function. (#4716) · e965d461
  Allen.Dou authored May 10, 2024
  
  e965d461
- [Core][Distributed] refactor pynccl (#4591) · 208b71bc
  youkaichao authored May 09, 2024
```
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
```
  208b71bc
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 9 commits

[Kernel] [FP8] Improve FP8 linear layer performance (#4691) · 379da6dc

Philipp Moritz authored May 09, 2024

This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).

We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.

Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:

qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)

379da6dc

[Model] Snowflake arctic model implementation (#4652) · ebce310b

Hao Zhang authored May 09, 2024


Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

ebce310b

[Bugfix] Add logs for all model dtype casting (#4717) · be0c5180
Michael Goin authored May 09, 2024

be0c5180
[Bugfix] Update grafana.json (#4711) · cea64430
Robert Shaw authored May 09, 2024

cea64430
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709) · a3c12457
Cyrus Leung authored May 10, 2024

a3c12457
[ROCm] Add support for Punica kernels on AMD GPUs (#3140) · ff5abcd7
kliuae authored May 10, 2024
```
Co-authored-by: miloice <jeffaw99@hotmail.com>
```
ff5abcd7
[Misc] Set block size at initialization & Fix test_model_runner (#4705) · 0ee535b2
Woosuk Kwon authored May 09, 2024

0ee535b2
[Misc] Remove unnecessary ModelRunner imports (#4703) · 190bc838
Woosuk Kwon authored May 09, 2024

190bc838
[Frontend] Move async logic outside of constructor (#4674) · f12b20de
Cyrus Leung authored May 09, 2024

f12b20de