Commits · 8a7cc254a064b8d42bf4de7a9c3f29552240dfd9 · OpenDAS / vllm_cscc

15 May, 2024 1 commit

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

14 May, 2024 1 commit
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
13 May, 2024 8 commits

[Kernel] Use flash-attn for decoding (#3648) · 1356df53

Stephen Krider authored May 13, 2024


Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

1356df53

[Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
Cody Yu authored May 13, 2024

ce532ff4
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
8bc68e19
[Misc] Enhance attention selector (#4751) · 0fca3cdc
Woosuk Kwon authored May 13, 2024

0fca3cdc
[Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
e7c46b95

[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) · 350f9e10

Cyrus Leung authored May 13, 2024

Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

350f9e10

[Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) · 702bee46
youkaichao authored May 12, 2024

702bee46
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713) · a709e87a
Robert Shaw authored May 12, 2024

a709e87a

11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 7 commits
- [Core][Test] fix function name typo in custom allreduce (#4750) · 4e121310
  youkaichao authored May 10, 2024
  
  4e121310
- [CI] Nits for bad initialization of SeqGroup in testing (#4748) · fcc2994b
  Robert Shaw authored May 10, 2024
  
  fcc2994b
- [Speculative decoding] CUDA graph support (#4295) · 2e7796f2
  heeju-kim2 authored May 11, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  2e7796f2
- [Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172
  SangBin Cho authored May 10, 2024
```
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.

When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.

I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
```
  6a0f6172
- [Misc] Keep only one implementation of the create_dummy_prompt function. (#4716) · e965d461
  Allen.Dou authored May 10, 2024
  
  e965d461
- [Core][Distributed] refactor pynccl (#4591) · 208b71bc
  youkaichao authored May 09, 2024
```
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
```
  208b71bc
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 3 commits
- [Misc] Set block size at initialization & Fix test_model_runner (#4705) · 0ee535b2
  Woosuk Kwon authored May 09, 2024
  
  0ee535b2
- [Misc] Remove unnecessary ModelRunner imports (#4703) · 190bc838
  Woosuk Kwon authored May 09, 2024
  
  190bc838
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
08 May, 2024 6 commits
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
- [CI/Test] fix swap test for multi gpu (#4689) · 230c4b38
  youkaichao authored May 08, 2024
  
  230c4b38
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) · 0f9a6e3d
  DefTruth authored May 09, 2024
  
  0f9a6e3d
- [CI] Make mistral tests pass (#4596) · f6a59309
  SangBin Cho authored May 09, 2024
  
  f6a59309
- [Core][Distributed] support cpu&device in broadcast tensor dict (#4660) · cc466a32
  youkaichao authored May 07, 2024
```
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
```
  cc466a32
07 May, 2024 3 commits
- [Bug fix][Core] fixup ngram not setup correctly (#4551) · 8344f774
  leiwen83 authored May 08, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  8344f774
- [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
  youkaichao authored May 07, 2024
  
  469f85c7
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 3 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 5 commits
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
02 May, 2024 2 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f