Commits · 1a8bfd92d5f35d638e3cfc8c4cd1779aeda0adfb · OpenDAS / vllm_cscc

"vllm/model_executor/models/step3p5_mtp.py" did not exist on "cdc1fa12eb1ba4795d24e97dcffa2018668a9267"

12 Jun, 2024 1 commit
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
09 Jun, 2024 1 commit
- [Core][CUDA Graph] add output buffer for cudagraph (#5074) · 0373e183
  youkaichao authored Jun 08, 2024
```
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)
```
  0373e183
07 Jun, 2024 1 commit
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
04 Jun, 2024 1 commit
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 1 commit
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
30 May, 2024 1 commit
- [Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129) · d79d9eaa
  Hyunsung Lee authored May 30, 2024
  
  d79d9eaa
29 May, 2024 1 commit
- [Core][Optimization] remove vllm-nccl (#5091) · 5bd3c650
  youkaichao authored May 28, 2024
  
  5bd3c650
28 May, 2024 2 commits
- [BugFix] Fix Embedding Models with TP>1 (#5075) · 9ba41558
  Robert Shaw authored May 28, 2024
  
  9ba41558
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
22 May, 2024 2 commits
- [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) · eb6d3c26
  Nick Hill authored May 22, 2024
  
  eb6d3c26
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

16 May, 2024 4 commits
- [Misc] remove old comments (#4866) · 10fa9eea
  youkaichao authored May 16, 2024
  
  10fa9eea
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 2 commits

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

13 May, 2024 3 commits
- [Kernel] Use flash-attn for decoding (#3648) · 1356df53
  Stephen Krider authored May 13, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
```
  1356df53
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
- [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) · 702bee46
  youkaichao authored May 12, 2024
  
  702bee46
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 1 commit
- [Core][Distributed] refactor pynccl (#4591) · 208b71bc
  youkaichao authored May 09, 2024
```
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
```
  208b71bc
09 May, 2024 1 commit
- [Misc] Set block size at initialization & Fix test_model_runner (#4705) · 0ee535b2
  Woosuk Kwon authored May 09, 2024
  
  0ee535b2
08 May, 2024 3 commits
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Core] Faster startup for LoRA enabled models (#4634) · ad932a22
  Antoni Baum authored May 08, 2024
  
  ad932a22
- [Misc] Add `get_name` method to attention backends (#4685) · 5510cf0e
  Woosuk Kwon authored May 08, 2024
  
  5510cf0e
07 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 1 commit
- [Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
  Cody Yu authored May 03, 2024
  
  bc8ad684
03 May, 2024 3 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
26 Apr, 2024 2 commits
- [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) · 603ad848
  SangBin Cho authored Apr 26, 2024
  
  603ad848
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
25 Apr, 2024 2 commits
- [Core] Move function tracing setup to util function (#4352) · efffb63f
  Nick Hill authored Apr 25, 2024
  
  efffb63f
- [Mypy] Typing lora folder (#4337) · b5b4a398
  SangBin Cho authored Apr 26, 2024
  
  b5b4a398
24 Apr, 2024 1 commit
- [Core][Distributed] use cpu/gloo to initialize pynccl (#4248) · 91f50a6f
  youkaichao authored Apr 23, 2024
  
  91f50a6f
23 Apr, 2024 2 commits
- [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286) · d87f39e9
  DefTruth authored Apr 24, 2024
  
  d87f39e9
- [Core] Some simplification of WorkerWrapper changes (#4183) · 8f2ea22b
  Nick Hill authored Apr 23, 2024
  
  8f2ea22b