Commits · 82079729ccd0830ce77fcc5fd7ea2be3bf81ccaf · OpenDAS / vllm_cscc

25 Jun, 2024 4 commits
- [Hardware][TPU] Raise errors for unsupported sampling params (#5850) · f178e56c
  Woosuk Kwon authored Jun 25, 2024
  
  f178e56c
- [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) · dd793d1d
  Matt Wong authored Jun 25, 2024
  
  dd793d1d
- [Hardware][TPU] Refactor TPU backend (#5831) · bc34937d
  Woosuk Kwon authored Jun 25, 2024
  
  bc34937d
- [Misc] Remove useless code in cpu_worker (#5824) · 7b993143
  Jie Fu (傅杰) authored Jun 26, 2024
  
  7b993143
21 Jun, 2024 2 commits
- [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) · f5dda63e
  rohithkrn authored Jun 21, 2024
  
  f5dda63e
- [Model] MLPSpeculator speculative decoding support (#4947) · b12518d3
  Joshua Rosenkranz authored Jun 20, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
```
  b12518d3
17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
13 Jun, 2024 1 commit
- [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) · ea3890a5
  youkaichao authored Jun 12, 2024
```
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
```
  ea3890a5
12 Jun, 2024 3 commits
- [Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451) · 2135cacb
  Isotr0py authored Jun 13, 2024
  
  2135cacb
- [Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef
  Travis Johnson authored Jun 12, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  51602eef
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
09 Jun, 2024 1 commit
- [Core][CUDA Graph] add output buffer for cudagraph (#5074) · 0373e183
  youkaichao authored Jun 08, 2024
```
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)
```
  0373e183
07 Jun, 2024 1 commit
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
04 Jun, 2024 1 commit
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 1 commit
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
30 May, 2024 1 commit
- [Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129) · d79d9eaa
  Hyunsung Lee authored May 30, 2024
  
  d79d9eaa
29 May, 2024 1 commit
- [Core][Optimization] remove vllm-nccl (#5091) · 5bd3c650
  youkaichao authored May 28, 2024
  
  5bd3c650
28 May, 2024 2 commits
- [BugFix] Fix Embedding Models with TP>1 (#5075) · 9ba41558
  Robert Shaw authored May 28, 2024
  
  9ba41558
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
22 May, 2024 2 commits
- [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) · eb6d3c26
  Nick Hill authored May 22, 2024
  
  eb6d3c26
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

16 May, 2024 4 commits
- [Misc] remove old comments (#4866) · 10fa9eea
  youkaichao authored May 16, 2024
  
  10fa9eea
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 2 commits

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

13 May, 2024 3 commits
- [Kernel] Use flash-attn for decoding (#3648) · 1356df53
  Stephen Krider authored May 13, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
```
  1356df53
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
- [Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754) · 702bee46
  youkaichao authored May 12, 2024
  
  702bee46
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 1 commit
- [Core][Distributed] refactor pynccl (#4591) · 208b71bc
  youkaichao authored May 09, 2024
```
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
```
  208b71bc
09 May, 2024 1 commit
- [Misc] Set block size at initialization & Fix test_model_runner (#4705) · 0ee535b2
  Woosuk Kwon authored May 09, 2024
  
  0ee535b2
08 May, 2024 3 commits
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Core] Faster startup for LoRA enabled models (#4634) · ad932a22
  Antoni Baum authored May 08, 2024
  
  ad932a22
- [Misc] Add `get_name` method to attention backends (#4685) · 5510cf0e
  Woosuk Kwon authored May 08, 2024
  
  5510cf0e
07 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2