Commits · ba262c4e5aa9fa753c8cedfaea5c42941184a0db · OpenDAS / vllm_cscc

31 Aug, 2024 1 commit
- [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) · 622f8abf
  Pavani Majety authored Aug 30, 2024
  
  622f8abf
29 Aug, 2024 2 commits
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for... · 6b342156
  Pavani Majety authored Aug 29, 2024
```
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  6b342156
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) · ef99a787
  youkaichao authored Aug 28, 2024
  
  ef99a787
28 Aug, 2024 1 commit
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) · b98cc28f
  Pavani Majety authored Aug 28, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  b98cc28f
27 Aug, 2024 1 commit
- Revert #7509 (#7887) · 9606c719
  Cody Yu authored Aug 27, 2024
  
  9606c719
21 Aug, 2024 1 commit
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention... · 53328d75
  LI MOU authored Aug 21, 2024
```
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
```
  53328d75
20 Aug, 2024 1 commit
- [Core] Add `AttentionState` abstraction (#7663) · 3b682179
  Antoni Baum authored Aug 20, 2024
  
  3b682179
16 Aug, 2024 1 commit
- register custom op for flash attn and use from torch.ops (#7536) · 54bd9a03
  youkaichao authored Aug 15, 2024
  
  54bd9a03
12 Aug, 2024 1 commit
- [Kernel] Flashinfer correctness fix for v0.1.3 (#7319) · ec2affa8
  Lily Liu authored Aug 12, 2024
  
  ec2affa8
07 Aug, 2024 1 commit
- [Kernel] Fix Flashinfer Correctness (#7284) · e53dfd3e
  Lily Liu authored Aug 07, 2024
  
  e53dfd3e
05 Aug, 2024 1 commit
- [MISC] Use non-blocking transfer in prepare_input (#7172) · ef527be0
  Cody Yu authored Aug 05, 2024
  
  ef527be0
02 Aug, 2024 1 commit
- [Kernel] Fix input for flashinfer prefill wrapper. (#7008) · 954f7305
  Lily Liu authored Aug 01, 2024
  
  954f7305
01 Aug, 2024 1 commit
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
25 Jul, 2024 1 commit
- [Bugfix] Fix decode tokens w. CUDA graph (#6757) · 309aaef8
  Cody Yu authored Jul 24, 2024
  
  309aaef8
24 Jul, 2024 2 commits
- [Core] Tweaks to model runner/input builder developer APIs (#6712) · 5448f676
  Antoni Baum authored Jul 24, 2024
  
  5448f676
- Add fp8 support to `reshape_and_cache_flash` (#6667) · 0e63494c
  Antoni Baum authored Jul 24, 2024
  
  0e63494c
23 Jul, 2024 1 commit
- [Core] Modulize prepare input and attention metadata builder (#6596) · e0c15758
  Cody Yu authored Jul 22, 2024
  
  e0c15758
20 Jul, 2024 1 commit
- [Misc] Consolidate and optimize logic for building padded tensors (#6541) · 9042d683
  Cyrus Leung authored Jul 20, 2024
  
  9042d683
18 Jul, 2024 1 commit
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2... · c8a7d51c
  Noam Gat authored Jul 18, 2024
```
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501)
```
  c8a7d51c
17 Jul, 2024 1 commit
- [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) · 2fa4623d
  Cody Yu authored Jul 17, 2024
  
  2fa4623d
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
08 Jul, 2024 1 commit

[Kernel] Correctly invoke prefill & decode kernels for cross-attention... · 543aa485

afeldman-nm authored Jul 08, 2024


[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

543aa485

04 Jul, 2024 1 commit
- [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) · 69ec3ca1
  Lily Liu authored Jul 04, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  69ec3ca1
01 Jul, 2024 1 commit
- [Bugfix] Add explicit `end_forward` calls to flashinfer (#6044) · c4059ea5
  Antoni Baum authored Jul 01, 2024
  
  c4059ea5
28 Jun, 2024 1 commit
- [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628) · 7041de43
  Lily Liu authored Jun 28, 2024
```
Co-authored-by: LiuXiaoxuanPKU &lt;llilyliupku@gmail.com&gt;, bong-furiosa <bongwon.jang@furiosa.ai>
```
  7041de43
26 Jun, 2024 1 commit

[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) · dda48115

Stephanie Wang authored Jun 25, 2024


Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>

dda48115

22 May, 2024 1 commit
- [Misc] Take user preference in attention selector (#4960) · ee3eea0a
  Cody Yu authored May 22, 2024
  
  ee3eea0a
16 May, 2024 1 commit
- [Bugfix] Fix FP8 KV cache support (#4869) · 9a31a817
  Woosuk Kwon authored May 16, 2024
  
  9a31a817
15 May, 2024 1 commit

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

13 May, 2024 1 commit
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
08 May, 2024 3 commits
- [Misc] Use vllm-flash-attn instead of flash-attn (#4686) · 89579a20
  Woosuk Kwon authored May 08, 2024
  
  89579a20
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Misc] Add `get_name` method to attention backends (#4685) · 5510cf0e
  Woosuk Kwon authored May 08, 2024
  
  5510cf0e
07 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
03 May, 2024 1 commit
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec