Commits · 87d41c849d2cde9279fb08a3a0d97123e3d8fe2f · OpenDAS / vllm_cscc

27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 2 commits
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 1 commit
- [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness,... · 26148120
  Alexei-V-Ivanov-AMD authored May 16, 2024
```
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
```
  26148120
16 May, 2024 2 commits
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 1 commit
- [Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
  zifeitong authored May 15, 2024
  
  a5675d34
14 May, 2024 1 commit
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
13 May, 2024 1 commit
- [Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
  Cody Yu authored May 13, 2024
  
  ce532ff4
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit
- [Bugfix] Add logs for all model dtype casting (#4717) · be0c5180
  Michael Goin authored May 09, 2024
  
  be0c5180
08 May, 2024 1 commit
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
05 May, 2024 1 commit
- Disable cuda version check in vllm-openai image (#4530) · 0650e593
  zhaoyang-star authored May 06, 2024
  
  0650e593
04 May, 2024 2 commits
- [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
  DearPlanet authored May 05, 2024
  
  43029870
- [Doc] Chunked Prefill Documentation (#4580) · 36fb68f9
  SangBin Cho authored May 04, 2024
  
  36fb68f9
03 May, 2024 2 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [Misc] centralize all usage of environment variables (#4548) · 5b8a7c1c
  youkaichao authored May 02, 2024
  
  5b8a7c1c
01 May, 2024 2 commits
- [Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  b38e42fb
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain... · a88bb9b0
  AnyISalIn authored May 02, 2024
```
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173)
Signed-off-by: AnyISalIn <anyisalin@gmail.com>
```
  a88bb9b0
29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 1 commit
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
26 Apr, 2024 1 commit
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
25 Apr, 2024 1 commit
- [Model] Adds Phi-3 support (#4298) · 96e90fde
  Caio Mendes authored Apr 25, 2024
  
  96e90fde
23 Apr, 2024 1 commit
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) · 62b8aebc
  Cade Daniel authored Apr 23, 2024
  
  62b8aebc
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
18 Apr, 2024 1 commit
- [Bugfix] Get available quantization methods from quantization registry (#4098) · 53b018ed
  Michael Goin authored Apr 18, 2024
  
  53b018ed
16 Apr, 2024 2 commits
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
- LM Format Enforcer Guided Decoding Support (#3868) · 05434764
  Noam Gat authored Apr 16, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  05434764
14 Apr, 2024 1 commit
- [Frontend] [Core] feat: Add model loading using `tensorizer` (#3476) · 711a0002
  Sanger Steel authored Apr 13, 2024
  
  711a0002
12 Apr, 2024 2 commits
- [mypy] Add mypy type annotation part 1 (#4006) · 09473ee4
  SangBin Cho authored Apr 13, 2024
  
  09473ee4
- [Core] Support LoRA on quantized models (#4012) · 1096717a
  Jee Li authored Apr 12, 2024
  
  1096717a
11 Apr, 2024 1 commit
- [Core][5/N] Fully working chunked prefill e2e (#3884) · 67b4221a
  SangBin Cho authored Apr 11, 2024
  
  67b4221a
10 Apr, 2024 1 commit
- [Bugfix] handle hf_config with architectures == None (#3982) · 934d3662
  Travis Johnson authored Apr 10, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  934d3662
09 Apr, 2024 1 commit
- [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable... · e7c7067b
  Cade Daniel authored Apr 09, 2024
```
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837)
```
  e7c7067b
05 Apr, 2024 1 commit
- [Chunked Prefill][4/n] Chunked prefill scheduler. (#3853) · 18de8834
  SangBin Cho authored Apr 06, 2024
  
  18de8834
03 Apr, 2024 2 commits

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

[Speculative decoding] Adding configuration object for speculative decoding (#3706) · 5757d90e
Cade Daniel authored Apr 02, 2024
```
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
```
5757d90e

02 Apr, 2024 1 commit

[Hardware][Intel] Add CPU inference backend (#3634) · 0e3f06fe

bigPYJ1151 authored Apr 02, 2024


Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>

0e3f06fe