Commits · dfbe60dc62409f03aa9eebc70ab2582ae64f0e1f · OpenDAS / vllm_cscc

02 Jun, 2024 2 commits
- [Misc] Simplify code and fix type annotations in `conftest.py` (#5118) · dfbe60dc
  Cyrus Leung authored Jun 03, 2024
  
  dfbe60dc
- Update test_ignore_eos (#4898) · ed59a7ed
  Simon Mo authored Jun 01, 2024
  
  ed59a7ed
01 Jun, 2024 3 commits
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 1 commit
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
30 May, 2024 1 commit
- [BUGFIX] [FRONTEND] Correct chat logprobs (#5029) · 87d41c84
  Breno Faria authored May 30, 2024
```
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  87d41c84
29 May, 2024 6 commits
- [Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099) · b1c25563
  Cyrus Leung authored May 30, 2024
  
  b1c25563
- [Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096) · eecd8643
  Cyrus Leung authored May 30, 2024
  
  eecd8643
- [Core] Cross-attention KV caching and memory-management (towards eventual... · 4238bc82
  afeldman-nm authored May 29, 2024
```
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
```
  4238bc82
- [Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092) · 18c1f16d
  Cyrus Leung authored May 29, 2024
  
  18c1f16d
- [Core][Optimization] remove vllm-nccl (#5091) · 5bd3c650
  youkaichao authored May 28, 2024
  
  5bd3c650
- [Bugfix] Remove the last EOS token unless explicitly specified (#5077) · dfba529b
  Junichi Sato authored May 29, 2024
  
  dfba529b
28 May, 2024 2 commits
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

25 May, 2024 2 commits
- [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) · d5a16977
  Lily Liu authored May 25, 2024
  
  d5a16977
- [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9
  Eric Xihui Lin authored May 25, 2024
```
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  8e192ff9
24 May, 2024 2 commits
- [Core][Bugfix]: fix prefix caching for blockv2 (#4764) · e64fde4b
  leiwen83 authored May 25, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  e64fde4b
- [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) · 91977095
  Robert Shaw authored May 24, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  91977095
23 May, 2024 3 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- [Core][1/N] Support send/recv in PyNCCL Groups (#4988) · 5eda2ea0
  Murali Andoorveedu authored May 23, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  5eda2ea0
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 6 commits
- [Misc] Take user preference in attention selector (#4960) · ee3eea0a
  Cody Yu authored May 22, 2024
  
  ee3eea0a
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988
- [misc] remove comments that were supposed to be removed (#4977) · c74c913b
  SangBin Cho authored May 22, 2024
  
  c74c913b
- [Frontend] Dynamic RoPE scaling (#4638) · 9b9a10d6
  sasha0552 authored May 22, 2024
  
  9b9a10d6
21 May, 2024 1 commit
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
20 May, 2024 2 commits
- [Build/CI] Enabling AMD Entrypoints Test (#4834) · 943e72ca
  Alexei-V-Ivanov-AMD authored May 20, 2024
```
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
```
  943e72ca
- [Kernel] Add flash-attn back (#4907) · b57e6c59
  Woosuk Kwon authored May 19, 2024
  
  b57e6c59
19 May, 2024 2 commits
- [Kernel] Add marlin_24 unit tests (#4901) · 27ce8547
  Alexander Matveev authored May 19, 2024
  
  27ce8547
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 2 commits
- [Bugfix] fix rope error when load models with different dtypes (#4835) · 33e0823d
  Jinzhen Lin authored May 17, 2024
  
  33e0823d
- [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness,... · 26148120
  Alexei-V-Ivanov-AMD authored May 16, 2024
```
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
```
  26148120
16 May, 2024 3 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- [Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850) · 8435b207
  Silencio authored May 17, 2024
```
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
```
  8435b207
- [Core][Distributed] remove graph mode function (#4818) · e0818808
  youkaichao authored May 16, 2024
  
  e0818808