Commits · 9ff4511e43bb95efefd4e28048ca257e408277fb · OpenDAS / vllm_cscc

30 Oct, 2024 1 commit
- [Misc] Add chunked-prefill support on FlashInfer. (#9781) · 9ff4511e
  Elfie Guo authored Oct 30, 2024
  
  9ff4511e
21 Oct, 2024 1 commit
- [Doc] Consistent naming of attention backends (#9498) · 496e991d
  Thomas Parnell authored Oct 21, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  496e991d
19 Oct, 2024 1 commit

[Kernel] Add env variable to force flashinfer backend to enable tensor cores (#9497) · 0c9a5258

Thomas Parnell authored Oct 19, 2024


Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

0c9a5258

17 Oct, 2024 1 commit

[Core] Deprecating block manager v1 and make block manager v2 default (#8704) · 81ede99c

Kuntai Du authored Oct 17, 2024

Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).

81ede99c

06 Oct, 2024 1 commit
- [core] use forward context for flash infer (#9097) · f4dd830e
  youkaichao authored Oct 05, 2024
  
  f4dd830e
03 Oct, 2024 1 commit
- [misc] add forward context for attention (#9029) · 9aaf14c6
  youkaichao authored Oct 03, 2024
  
  9aaf14c6
01 Oct, 2024 1 commit
- [Spec Decode] (1/2) Remove batch expansion (#8839) · 15702038
  Lily Liu authored Oct 01, 2024
  
  15702038
27 Sep, 2024 2 commits
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378) · c2ec430a
  Varun Sundar Rabindranath authored Sep 27, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  c2ec430a
- [torch.compile] use empty tensor instead of None for profiling (#8875) · a9b15c60
  youkaichao authored Sep 27, 2024
  
  a9b15c60
17 Sep, 2024 1 commit
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) · 1009e93c
  sroy745 authored Sep 17, 2024
  
  1009e93c
12 Sep, 2024 2 commits
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427) · 01987725
  Alexander Matveev authored Sep 12, 2024
  
  01987725
- [multi-step] add flashinfer backend (#7928) · a6c0f365
  William Lin authored Sep 12, 2024
  
  a6c0f365
05 Sep, 2024 1 commit
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) · e39ebf5c
  Elfie Guo authored Sep 04, 2024
  
  e39ebf5c
31 Aug, 2024 1 commit
- [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) · 622f8abf
  Pavani Majety authored Aug 30, 2024
  
  622f8abf
29 Aug, 2024 2 commits
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for... · 6b342156
  Pavani Majety authored Aug 29, 2024
```
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  6b342156
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) · ef99a787
  youkaichao authored Aug 28, 2024
  
  ef99a787
28 Aug, 2024 1 commit
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) · b98cc28f
  Pavani Majety authored Aug 28, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  b98cc28f
27 Aug, 2024 1 commit
- Revert #7509 (#7887) · 9606c719
  Cody Yu authored Aug 27, 2024
  
  9606c719
21 Aug, 2024 1 commit
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention... · 53328d75
  LI MOU authored Aug 21, 2024
```
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
```
  53328d75
20 Aug, 2024 1 commit
- [Core] Add `AttentionState` abstraction (#7663) · 3b682179
  Antoni Baum authored Aug 20, 2024
  
  3b682179
16 Aug, 2024 1 commit
- register custom op for flash attn and use from torch.ops (#7536) · 54bd9a03
  youkaichao authored Aug 15, 2024
  
  54bd9a03
12 Aug, 2024 1 commit
- [Kernel] Flashinfer correctness fix for v0.1.3 (#7319) · ec2affa8
  Lily Liu authored Aug 12, 2024
  
  ec2affa8
07 Aug, 2024 1 commit
- [Kernel] Fix Flashinfer Correctness (#7284) · e53dfd3e
  Lily Liu authored Aug 07, 2024
  
  e53dfd3e
05 Aug, 2024 1 commit
- [MISC] Use non-blocking transfer in prepare_input (#7172) · ef527be0
  Cody Yu authored Aug 05, 2024
  
  ef527be0
02 Aug, 2024 1 commit
- [Kernel] Fix input for flashinfer prefill wrapper. (#7008) · 954f7305
  Lily Liu authored Aug 01, 2024
  
  954f7305
01 Aug, 2024 1 commit
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
25 Jul, 2024 1 commit
- [Bugfix] Fix decode tokens w. CUDA graph (#6757) · 309aaef8
  Cody Yu authored Jul 24, 2024
  
  309aaef8
24 Jul, 2024 2 commits
- [Core] Tweaks to model runner/input builder developer APIs (#6712) · 5448f676
  Antoni Baum authored Jul 24, 2024
  
  5448f676
- Add fp8 support to `reshape_and_cache_flash` (#6667) · 0e63494c
  Antoni Baum authored Jul 24, 2024
  
  0e63494c
23 Jul, 2024 1 commit
- [Core] Modulize prepare input and attention metadata builder (#6596) · e0c15758
  Cody Yu authored Jul 22, 2024
  
  e0c15758
20 Jul, 2024 1 commit
- [Misc] Consolidate and optimize logic for building padded tensors (#6541) · 9042d683
  Cyrus Leung authored Jul 20, 2024
  
  9042d683
18 Jul, 2024 1 commit
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2... · c8a7d51c
  Noam Gat authored Jul 18, 2024
```
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501)
```
  c8a7d51c
17 Jul, 2024 1 commit
- [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) · 2fa4623d
  Cody Yu authored Jul 17, 2024
  
  2fa4623d
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
08 Jul, 2024 1 commit

[Kernel] Correctly invoke prefill & decode kernels for cross-attention... · 543aa485

afeldman-nm authored Jul 08, 2024


[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

543aa485

04 Jul, 2024 1 commit
- [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) · 69ec3ca1
  Lily Liu authored Jul 04, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  69ec3ca1
01 Jul, 2024 1 commit
- [Bugfix] Add explicit `end_forward` calls to flashinfer (#6044) · c4059ea5
  Antoni Baum authored Jul 01, 2024
  
  c4059ea5
28 Jun, 2024 1 commit
- [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628) · 7041de43
  Lily Liu authored Jun 28, 2024
```
Co-authored-by: LiuXiaoxuanPKU &lt;llilyliupku@gmail.com&gt;, bong-furiosa <bongwon.jang@furiosa.ai>
```
  7041de43
26 Jun, 2024 1 commit

[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) · dda48115

Stephanie Wang authored Jun 25, 2024


Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>

dda48115

22 May, 2024 1 commit
- [Misc] Take user preference in attention selector (#4960) · ee3eea0a
  Cody Yu authored May 22, 2024
  
  ee3eea0a