Commits · 7b0a0dfb22907505441f8a4a5eb882cbca4d2acf · OpenDAS / vllm_cscc

05 Jun, 2024 4 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252) · faf71bcd
  Nick Hill authored Jun 05, 2024
  
  faf71bcd
- [Misc] Add CustomOp interface for device portability (#5255) · 41ca62cf
  Woosuk Kwon authored Jun 05, 2024
  
  41ca62cf
- [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) · 974fc9b8
  zifeitong authored Jun 04, 2024
  
  974fc9b8
04 Jun, 2024 4 commits
- [CI/Build] Simplify model loading for `HfRunner` (#5251) · 9ba093b4
  Cyrus Leung authored Jun 05, 2024
  
  9ba093b4
- [CI/Build] Add inputs tests (#5215) · ec784b25
  Cyrus Leung authored Jun 04, 2024
  
  ec784b25
- [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env... · f42a006b
  afeldman-nm authored Jun 03, 2024
```
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210)
```
  f42a006b
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 5 commits
- [FRONTEND] OpenAI `tools` support named functions (#5032) · f775a07e
  Breno Faria authored Jun 04, 2024
  
  f775a07e
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
- [CI/BUILD] enable intel queue for longer CPU tests (#4113) · cafb8e06
  Yuan authored Jun 04, 2024
  
  cafb8e06
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
02 Jun, 2024 2 commits
- [Misc] Simplify code and fix type annotations in `conftest.py` (#5118) · dfbe60dc
  Cyrus Leung authored Jun 03, 2024
  
  dfbe60dc
- Update test_ignore_eos (#4898) · ed59a7ed
  Simon Mo authored Jun 01, 2024
  
  ed59a7ed
01 Jun, 2024 3 commits
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 1 commit
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
30 May, 2024 1 commit
- [BUGFIX] [FRONTEND] Correct chat logprobs (#5029) · 87d41c84
  Breno Faria authored May 30, 2024
```
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  87d41c84
29 May, 2024 6 commits
- [Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099) · b1c25563
  Cyrus Leung authored May 30, 2024
  
  b1c25563
- [Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096) · eecd8643
  Cyrus Leung authored May 30, 2024
  
  eecd8643
- [Core] Cross-attention KV caching and memory-management (towards eventual... · 4238bc82
  afeldman-nm authored May 29, 2024
```
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
```
  4238bc82
- [Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092) · 18c1f16d
  Cyrus Leung authored May 29, 2024
  
  18c1f16d
- [Core][Optimization] remove vllm-nccl (#5091) · 5bd3c650
  youkaichao authored May 28, 2024
  
  5bd3c650
- [Bugfix] Remove the last EOS token unless explicitly specified (#5077) · dfba529b
  Junichi Sato authored May 29, 2024
  
  dfba529b
28 May, 2024 2 commits
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

25 May, 2024 2 commits
- [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) · d5a16977
  Lily Liu authored May 25, 2024
  
  d5a16977
- [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9
  Eric Xihui Lin authored May 25, 2024
```
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  8e192ff9
24 May, 2024 2 commits
- [Core][Bugfix]: fix prefix caching for blockv2 (#4764) · e64fde4b
  leiwen83 authored May 25, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  e64fde4b
- [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) · 91977095
  Robert Shaw authored May 24, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  91977095
23 May, 2024 3 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- [Core][1/N] Support send/recv in PyNCCL Groups (#4988) · 5eda2ea0
  Murali Andoorveedu authored May 23, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  5eda2ea0
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 4 commits
- [Misc] Take user preference in attention selector (#4960) · ee3eea0a
  Cody Yu authored May 22, 2024
  
  ee3eea0a
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988