Commits · 8ebc32aa4fcadbbcd5beac3695e96e0eda271e05 · OpenDAS / vllm_cscc

04 Aug, 2024 1 commit
- add llama model awq support · 5f5ddc3d
  gaoqiong authored Aug 04, 2024
  
  5f5ddc3d
09 Jul, 2024 1 commit
- Support Deepseek-V2 (#4650) · b1b95055
  huangwb authored Jul 09, 2024
  
  b1b95055
11 Jun, 2024 3 commits
- [Core][Doc] Default to multiprocessing for single-node distributed case (#5230) · 99dac099
  Nick Hill authored Jun 11, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  99dac099
- [Frontend] Customizable RoPE theta (#5197) · dcbf4286
  sasha0552 authored Jun 11, 2024
  
  dcbf4286
- [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) · 351d5e7b
  maor-ps authored Jun 11, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  351d5e7b
10 Jun, 2024 1 commit
- [Misc] Update to comply with the new `compressed-tensors` config (#5350) · 5884c2b4
  Dipika Sikka authored Jun 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  5884c2b4
07 Jun, 2024 1 commit
- [Frontend] Add OpenAI Vision API Support (#5237) · 7a9cb294
  Roger Wang authored Jun 07, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  7a9cb294
06 Jun, 2024 2 commits
- Bugfix: fix broken of download models from modelscope (#5233) · 4efff036
  liuyhwangyh authored Jun 07, 2024
```
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
```
  4efff036
- [CI/Build] Update vision tests (#5307) · 89c92078
  Cyrus Leung authored Jun 06, 2024
  
  89c92078
05 Jun, 2024 1 commit
- [BugFix] Fix log message about default max model length (#5284) · 3d33e372
  Nick Hill authored Jun 05, 2024
  
  3d33e372
03 Jun, 2024 2 commits
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
30 May, 2024 1 commit
- [Bugfix] Automatically Detect SparseML models (#5119) · d910816c
  Robert Shaw authored May 30, 2024
  
  d910816c
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 2 commits
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 1 commit
- [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness,... · 26148120
  Alexei-V-Ivanov-AMD authored May 16, 2024
```
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
```
  26148120
16 May, 2024 2 commits
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
15 May, 2024 1 commit
- [Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) · a5675d34
  zifeitong authored May 15, 2024
  
  a5675d34
14 May, 2024 1 commit
- [Core] Add MultiprocessingGPUExecutor (#4539) · 676a9998
  Nick Hill authored May 14, 2024
```
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
```
  676a9998
13 May, 2024 1 commit
- [Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
  Cody Yu authored May 13, 2024
  
  ce532ff4
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit
- [Bugfix] Add logs for all model dtype casting (#4717) · be0c5180
  Michael Goin authored May 09, 2024
  
  be0c5180
08 May, 2024 1 commit
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
05 May, 2024 1 commit
- Disable cuda version check in vllm-openai image (#4530) · 0650e593
  zhaoyang-star authored May 06, 2024
  
  0650e593
04 May, 2024 2 commits
- [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
  DearPlanet authored May 05, 2024
  
  43029870
- [Doc] Chunked Prefill Documentation (#4580) · 36fb68f9
  SangBin Cho authored May 04, 2024
  
  36fb68f9
03 May, 2024 2 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [Misc] centralize all usage of environment variables (#4548) · 5b8a7c1c
  youkaichao authored May 02, 2024
  
  5b8a7c1c
01 May, 2024 2 commits
- [Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  b38e42fb
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain... · a88bb9b0
  AnyISalIn authored May 02, 2024
```
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173)
Signed-off-by: AnyISalIn <anyisalin@gmail.com>
```
  a88bb9b0
29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 1 commit
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
26 Apr, 2024 1 commit
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
25 Apr, 2024 1 commit
- [Model] Adds Phi-3 support (#4298) · 96e90fde
  Caio Mendes authored Apr 25, 2024
  
  96e90fde
23 Apr, 2024 1 commit
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) · 62b8aebc
  Cade Daniel authored Apr 23, 2024
  
  62b8aebc