Commits · b2c620230a6efdc590b06b10f8e89f42362a150a · OpenDAS / vllm_cscc

"vllm/model_executor/models/qwen2.py" did not exist on "a9e4574261a20d4ada213d26671da7dc7633580b"

28 Jun, 2024 5 commits
- [Spec Decode] Introduce DraftModelRunner (#5799) · b2c62023
  Cody Yu authored Jun 28, 2024
  
  b2c62023
- [Hardware][Intel] OpenVINO vLLM backend (#5379) · 57f09a41
  Ilya Lavrenov authored Jun 28, 2024
  
  57f09a41
- [Core] Registry for processing model inputs (#5214) · 5cbe8d15
  Cyrus Leung authored Jun 28, 2024
```
Co-authored-by: ywang96 <ywang@roblox.com>
```
  5cbe8d15
- [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956) · 0d0e3a42
  Isotr0py authored Jun 28, 2024
  
  0d0e3a42
- [Hardware][TPU] Optimize KV cache swapping (#5878) · f136da15
  Woosuk Kwon authored Jun 27, 2024
  
  f136da15
27 Jun, 2024 2 commits
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
- [BugFix] Fix cuda graph for MLPSpeculator (#5875) · 2110557d
  Nick Hill authored Jun 26, 2024
```
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com>
```
  2110557d
26 Jun, 2024 4 commits
- [Bugfix][TPU] Fix CPU cache allocation (#5869) · f5c8628f
  Woosuk Kwon authored Jun 26, 2024
  
  f5c8628f
- [Hardware][TPU] Support parallel sampling & Swapping (#5855) · cbc53b6b
  Woosuk Kwon authored Jun 26, 2024
  
  cbc53b6b
- [Bugfix][TPU] Fix KV cache size calculation (#5860) · 3439c5a8
  Woosuk Kwon authored Jun 26, 2024
  
  3439c5a8
- [Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) · dda48115
  Stephanie Wang authored Jun 25, 2024
```
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>
```
  dda48115
25 Jun, 2024 4 commits
- [Hardware][TPU] Raise errors for unsupported sampling params (#5850) · f178e56c
  Woosuk Kwon authored Jun 25, 2024
  
  f178e56c
- [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) · dd793d1d
  Matt Wong authored Jun 25, 2024
  
  dd793d1d
- [Hardware][TPU] Refactor TPU backend (#5831) · bc34937d
  Woosuk Kwon authored Jun 25, 2024
  
  bc34937d
- [Misc] Remove useless code in cpu_worker (#5824) · 7b993143
  Jie Fu (傅杰) authored Jun 26, 2024
  
  7b993143
21 Jun, 2024 2 commits
- [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) · f5dda63e
  rohithkrn authored Jun 21, 2024
  
  f5dda63e
- [Model] MLPSpeculator speculative decoding support (#4947) · b12518d3
  Joshua Rosenkranz authored Jun 20, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
```
  b12518d3
17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
13 Jun, 2024 1 commit
- [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) · ea3890a5
  youkaichao authored Jun 12, 2024
```
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
```
  ea3890a5
12 Jun, 2024 3 commits
- [Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451) · 2135cacb
  Isotr0py authored Jun 13, 2024
  
  2135cacb
- [Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef
  Travis Johnson authored Jun 12, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  51602eef
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
09 Jun, 2024 1 commit
- [Core][CUDA Graph] add output buffer for cudagraph (#5074) · 0373e183
  youkaichao authored Jun 08, 2024
```
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)
```
  0373e183
07 Jun, 2024 1 commit
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
04 Jun, 2024 1 commit
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 1 commit
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
30 May, 2024 1 commit
- [Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129) · d79d9eaa
  Hyunsung Lee authored May 30, 2024
  
  d79d9eaa
29 May, 2024 1 commit
- [Core][Optimization] remove vllm-nccl (#5091) · 5bd3c650
  youkaichao authored May 28, 2024
  
  5bd3c650
28 May, 2024 2 commits
- [BugFix] Fix Embedding Models with TP>1 (#5075) · 9ba41558
  Robert Shaw authored May 28, 2024
  
  9ba41558
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
22 May, 2024 2 commits
- [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) · eb6d3c26
  Nick Hill authored May 22, 2024
  
  eb6d3c26
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

16 May, 2024 4 commits
- [Misc] remove old comments (#4866) · 10fa9eea
  youkaichao authored May 16, 2024
  
  10fa9eea
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 1 commit

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1